targeted nanopore sequencing for phased variant detection · control na12878 dna and a...

1
1 Johns Hopkins University, Department of Biomedical Engineering 2 Agilent Technologies Isac Lee 1 , Rachael Workman 1 , Josh Wang 2 , Yunfan Fan 1 , Winston Timp 1 Targeted Nanopore sequencing for phased variant detection FASTA sequences BWA Sequence Alignment Aligned BAM Sequences Sniffles SV detection Nanopolish Basecalling correction SNP Detection SNV candidate loci SV candidate loci Bioinformatics Pipeline Target DNA Biotinylated RNA probe Adaptor Streptavidin-coated magnetic bead Genomic DNA Adaptor-tagged DNA Solution-phase Hybridization Magnetic Bead Separation Enriched DNA Library PCR Amplification Solution-phase Hybridization-capture Sequenced reads were aligned to the human reference genome (hg38) using bwa mem -x ont2d. The aligned BAM sequences were used to 1) detect structural variations using sniffles (https://github.com/fritzsedlazeck/Sniffles) and 2) corect the sequence reads using raw signal via nanopolish 1 (https://github.com/jts/nanopolish), resulting in calls for both SVs and single nucleotide variations from nanopore data. Hybridization probes were designed to capture a 2.5Mb region encompassing the p16/CDKN2A and SMAD4 tumore suppressor genes. DNA was sheared to ~3kb, then adapters ligated and PCR amplified, followed by Blue Pippin size selection before hybridization and streptavidin bead capture. The resulting library was prepared for ONT nanopore sequencing on the Mk1 MinION with R9 pore chemistry. Nanopore sequencing yielded current signals basecalled using the Metriochor cloud basecalling system (ONT). Large-scale genomic anomalies – structural variations (SVs) – are pervasive in cancer. Due to the scale of the SVs and the repetitive nature of the sequences frequently flanking them, short-read sequencing is limited in its ability to detect SVs. In contrast, long read sequencing, e.g. nanopore sequencing, enables sequencing of unfragmented or otherwise long strands of DNA, improving detection of SVs. Recently, nanopore sequencing has been commercialized in the MinION, a sequencer from Oxford Nanopore (ONT). However, the technology has relatively low yield (~2Gb) making it difficult to employ on whole human genome samples at sufficient coverage, especially for heterogeneous samples or rare event detection. To increase the sequencing depth at “hotspot” regions of SVs, we have adapted solution-phase hybridization capture (SureSelect) to nanopore sequencing, enriching for the p16/CDKN2A and SMAD4 tumor suppressor gene regions, known to be frequently disrupted in pancreatic ductal adenocarcinoma (PDAC). Here we present data from this method on control NA12878 DNA and a patient-derived PDAC cell line DNA. Thanks to Oxford Nanopore for outstanding technical support, Agilent Technologies for working on the targeted enrichment optimization, Jared Simpson for assistance with nanopolish, and Dr. James Eshleman’s group for providing the PDAC cell line DNA. References: 1. Loman, NJ, et al. 2015. “A Complete Bacterial Genome Assembled de Novo Using Only Nanopore Sequencing Data.” bioRxiv. doi:10.1101/015552. 2. Norris, AL., et al. 2015. “Transflip Mutations Produce Deletions in Pancreatic Cancer.” Genes, Chromosomes & Cancer, May. doi:10.1002/gcc.22258. 3. Eberle, MA., et al. 2016. “A Reference Dataset of 5.4 Million Phased Human Variants Validated by Genetic Inheritance from Sequencing a Three-Generation 17-Member Pedigree.” bioRxiv. doi:10.1101/055541. Introduction Methods Acknowledgments We found that our targeted sequencing method is effective in generating ~20x average coverage in our regions on a single MinION nanopore sequencing flowcell, sufficient to perform SV and SNV analysis. With recent improvements in yield, we should be able to multiplex and run multiple samples on one flowcell. Comparison of our SVs and SNVs detected on the control NA12878 DNA with control data confirms the validity of this approach, though sensitivity will likely be improved with increased sequencing depth. In application of this technique to the PDAC DNA, we confirmed a previously discovered SV and detected a new SV. Nanopolish basecalling correction enables SNV detection with nanopore sequence reads - comparable to sequence reads from conventional platforms. With the ability to phase the variants using nanopolish-corrected reads, we identify patterns of mutation and allele-specificity of mutation along kilobase scale stretches of the genome. We believe this will be useful in enhancing the variant calling accuracy, detecting rare mutations, and/or deciphering novel relationships between SNVs. We are currently working to extend the read length and apply our method to more clinical samples. Coverage of sequencing on long-read nanopore (minION) and short-read illumina sequencing (MiSeq) in the target regions. Our targeted sequencing shows >300x fold-enrichment, >30% on-target percentage, resulting in >20x average coverage on both platforms. Structural variations detected using sniffles algorithm on the PDAC nanopore sequence reads. SVs detected on PDAC DNA included the previously discovered SV in CDKN2A gene 2 and a novel putative SV in SMAD4 gene. SNV detection at read-level resolution using nanopolish algorithm on the NA12878 nanopore sequence reads. Nanopolish successfully reduces false positive SNV detection (numbers in parentheses indicate correct;incorrect) in comparison with the annotated NA12878 SNVs 3 and enables phased SNV analysis as shown by the distinction of homozygous versus heterozygous SNVs. NA12878 Reference SNVs Read-level Detected SNVs Hom. Het. True SNVs (6;0) An example subset of NA12878 SNVs 0 .5 1 21,008,000 21,008,500 21,009,000 chr9 Coordinates Mismatch Frequenacy Post-Nanopolish (6;2) Pre-Nanopolish (4;3) Illumina (6;0) 51,198 51,199 51,200 51,201 51,202 PDAC Legend Capture Probe Structural Variation chr18 Coordinates (kbps) PDAC SMAD4 Structural Variation 0 20 40 21,800 22,000 22,200 22,400 chr9 Coordinates (kpbs) Coverage PDAC CDKN2A Structural Variation PDAC NA12878 Legend Structural Variation 0 10 20 30 40 50 Illumina Nanopore Legend Capture Region 50,500 51,000 50,500 50,500 50,500 0 50 100 150 200 chr18 Coordinates (kpbs) Nanopore Coverage Illumina Coverage SMAD4 Capture Region 0 15 30 45 60 21,000 21,500 22,000 22,500 chr9 Coordinates (kpbs) CDKN2A Capture Region 0 50 100 150 200 Nanopore Coverage Illumina Coverage Discussion Results

Upload: others

Post on 05-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Targeted Nanopore sequencing for phased variant detection · control NA12878 DNA and a patient-derived PDAC cell line DNA. Thanks to Oxford Nanopore for outstanding technical support,

1Johns Hopkins University, Department of Biomedical Engineering2Agilent Technologies

Isac Lee1, Rachael Workman1, Josh Wang2, Yunfan Fan1, Winston Timp1

Targeted Nanopore sequencing for phased variant detection

FASTA sequences

BWA Sequence Alignment

Aligned BAMSequences

Sni�es SV detection

NanopolishBasecalling correction

SNPDetection

SNVcandidate loci

SV candidate loci

Bioinformatics Pipeline

Target DNA

Biotinylated RNA probe

Adaptor

Streptavidin-coated magnetic bead

Genomic DNAAdaptor-tagged DNA Solution-phase Hybridization

Magnetic Bead Separation

Enriched DNA Library

PCR Amplification

Solution-phase Hybridization-capture

Sequenced reads were aligned to the human reference genome (hg38) using bwa mem -x ont2d. The aligned BAM sequences were used to 1) detect structural variations using sniffles (https://github.com/fritzsedlazeck/Sniffles) and 2) corect the sequence reads using raw signal via nanopolish1 (https://github.com/jts/nanopolish), resulting in calls for both SVs and single nucleotide variations from nanopore data.

Hybridization probes were designed to capture a 2.5Mb region encompassing the p16/CDKN2A and SMAD4 tumore suppressor genes. DNA was sheared to ~3kb, then adapters ligated and PCR amplified, followed by Blue Pippin size selection before hybridization and streptavidin bead capture. The resulting library was prepared for ONT nanopore sequencing on the Mk1 MinION with R9 pore chemistry. Nanopore sequencing yielded current signals basecalled using the Metriochor cloud basecalling system (ONT).

Large-scale genomic anomalies – structural variations (SVs) – are pervasive in cancer. Due to the scale of the SVs and the repetitive nature of the sequences frequently flanking them, short-read sequencing is limited in its ability to detect SVs. In contrast, long read sequencing, e.g. nanopore sequencing, enables sequencing of unfragmented or otherwise long strands of DNA, improving detection of SVs. Recently, nanopore sequencing has been commercialized in the MinION, a sequencer from Oxford Nanopore (ONT). However, the technology has relatively low yield (~2Gb) making it difficult to employ on whole human genome samples at sufficient coverage, especially for heterogeneous samples or rare event detection. To increase the sequencing depth at “hotspot” regions of SVs, we have adapted solution-phase hybridization capture (SureSelect) to nanopore sequencing, enriching for the p16/CDKN2A and SMAD4 tumor suppressor gene regions, known to be frequently disrupted in pancreatic ductal adenocarcinoma (PDAC). Here we present data from this method on control NA12878 DNA and a patient-derived PDAC cell line DNA.

Thanks to Oxford Nanopore for outstanding technical support, Agilent Technologies for working on the targeted enrichment optimization, Jared Simpson for assistance with nanopolish, and Dr. James Eshleman’s group for providing the PDAC cell line DNA. References:1. Loman, NJ, et al. 2015. “A Complete Bacterial Genome Assembled de Novo Using Only Nanopore Sequencing Data.” bioRxiv. doi:10.1101/015552.2. Norris, AL., et al. 2015. “Transflip Mutations Produce Deletions in Pancreatic Cancer.” Genes, Chromosomes & Cancer, May. doi:10.1002/gcc.22258.3. Eberle, MA., et al. 2016. “A Reference Dataset of 5.4 Million Phased Human Variants Validated by Genetic Inheritance from Sequencing a Three-Generation 17-Member Pedigree.” bioRxiv. doi:10.1101/055541.

Introduction

Methods

Acknowledgments

We found that our targeted sequencing method is effective in generating ~20x average coverage in our regions on a single MinION nanopore sequencing flowcell, sufficient to perform SV and SNV analysis. With recent improvements in yield, we should be able to multiplex and run multiple samples on one flowcell. Comparison of our SVs and SNVs detected on the control NA12878 DNA with control data confirms the validity of this approach, though sensitivity will likely be improved with increased sequencing depth. In application of this technique to the PDAC DNA, we confirmed a previously discovered SV and detected a new SV. Nanopolish basecalling correction enables SNV detection with nanopore sequence reads - comparable to sequence reads from conventional platforms. With the ability to phase the variants using nanopolish-corrected reads, we identify patterns of mutation and allele-specificity of mutation along kilobase scale stretches of the genome. We believe this will be useful in enhancing the variant calling accuracy, detecting rare mutations, and/or deciphering novel relationships between SNVs. We are currently working to extend the read length and apply our method to more clinical samples.

Coverage of sequencing on long-read nanopore (minION) and short-read illumina sequencing (MiSeq) in the target regions. Our targeted sequencing shows >300x fold-enrichment, >30% on-target percentage, resulting in >20x average coverage on both platforms.

Structural variations detected using sniffles algorithm on the PDAC nanopore sequence reads. SVs detected on PDAC DNA included the previously discovered SV in CDKN2A gene2 and a novel putative SV in SMAD4 gene.

SNV detection at read-level resolution using nanopolish algorithm on the NA12878 nanopore sequence reads. Nanopolish successfully reduces false positive SNV detection (numbers in parentheses indicate correct;incorrect) in comparison with the annotated NA12878 SNVs3 and enables phased SNV analysis as shown by the distinction of homozygous versus heterozygous SNVs.

NA12878Reference SNVs

Read-levelDetected SNVs

Hom. Het.True SNVs (6;0)

An example subset of NA12878 SNVs

0

.5

1

21,008,000 21,008,500 21,009,000chr9 Coordinates

MismatchFrequenacy

Post-Nanopolish (6;2)

Pre-Nanopolish (4;3)

Illumina (6;0)

51,198 51,199 51,200 51,201 51,202

PDACLegend

Capture ProbeStructural Variation

chr18 Coordinates (kbps)

PDAC SMAD4 Structural Variation

0

20

40

21,800 22,000 22,200 22,400chr9 Coordinates (kpbs)

Cov

erag

e

PDAC CDKN2A Structural Variation

PDACNA12878

Legend

Structural Variation

0

10

20

30

40

50

IlluminaNanopore

Legend

Capture Region

50,500 51,00050,500 50,50050,500

0

50

100

150

200

chr18 Coordinates (kpbs)

Nan

opor

e C

over

age Illum

ina Coverage

SMAD4 Capture Region

0

15

30

45

60

21,000 21,500 22,000 22,500chr9 Coordinates (kpbs)

CDKN2A Capture Region

0

50

100

150

200

Nan

opor

e C

over

age Illum

ina Coverage

Discussion

Results