detection and phasing of small variants in genome in a ... · detection and phasing of small...

1
For Research Use Only. Not for use in diagnostic procedures. © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. Femto Pulse and Fragment Analyzer are trademarks of Agilent Technologies Inc. All other trademarks are the sole property of their respective owners. Detection and phasing of small variants in Genome in a Bottle samples with highly accurate long reads William J. Rowell, Shreyasee Chakraborty, Primo Baybayan, Richard J. Hall Pacific Biosciences, 1305 O'Brien Drive, Menlo Park, CA 94025 - Long-read sequencing has been applied successfully to assemble genomes and detect structural variants. - Long reads can be unambiguously mapped to more of the genome than short reads of comparable accuracy. - However, it has been challenging to call small variants from long reads due to higher error rates for raw reads. - PacBio HiFi reads that are >99% accurate and 10- 20 kb in length enable detection of small and large variants, increasing discovery power for human genetics research. - We sequenced the Genome in a Bottle 2 (GIAB) reference samples HG001, HG002, and HG005 to ~30-fold coverage to determine the optimal coverage depth for small variant detection and phasing. Introduction Analysis Results Figure 1. Circular Consensus Sequencing (CCS). (a) A 10-20 kb linear template sequence is (b) ligated to SMRTbell adapters. (c) DNA polymerase synthesizes complementary sequences to both strands of the original linear template, leading to (d) rolling circle sequencing of the original template. (e) CCS analysis uses the noisy individual subreads to generate (f) a highly accurate consensus sequence (HiFi read). (g) With 8 passes, average accuracy is ~QV30 (99.9%). Passes 30 0 10 20 40 50 8 5 15 20 0 10 g Subread Passes CCS Accuracy (Phred) a b c d e f HiFi read Methods Figure 2. Bioinformatics workflow for read mapping and variant detection. ~30-fold coverage (six SMRT Cells 8M with Sequel II System chemistry 1.0) of highly accurate (average 99.8%) 11 kb reads were mapped to the hg19 (hs37d5) reference with pbmm2. For each sample, aligned reads were randomly down-sampled 10 times for each 10% coverage depth increment. Single nucleotide variants (SNVs) and small indels (<50 bp) were detected using Google DeepVariant v0.8.0 and phased with WhatsHap v0.18. Variant calls were evaluated against GIAB v3.3.2 benchmarks. Sample Coverage Depth SNV Recall SNV Precision Indel Recall Indel Precision HG001 14-fold 98.71% 99.40% 93.46% 95.14% HG001 29-fold 99.89% 99.87% 98.09% 98.08% HG002 16-fold 99.16% 99.63% 95.21% 96.76% HG002 32-fold 99.95% 99.97% 98.86% 99.10% HG005 16-fold 99.12% 99.59% 96.31% 97.23% HG005 31-fold 99.92% 99.92% 99.12% 99.04% Results Align to reference genome with pbmm2 For each depth titration point, detect small variants with DeepVariant 3 and phase with WhatsHap 4 For each depth titration point, compare to GIAB small variant benchmark using Hap.py 5 Randomly down-sample aligned reads Table 1. Small variant benchmarks at ~15- and ~30-fold coverage. At ~30-fold coverage, DeepVariant achieved >99.8% precision and recall for SNVs, and >98.0% precision and recall for indels, while at ~15-fold coverage, DeepVariant achieved >98.7% precision and recall for SNVs, and >93.5% precision and recall for indels. Figure 3. Small variant benchmarks and phasing statistics over titration of coverage depth. (a) Precision and recall for SNVs and indels at different coverage titration levels. (b) Size of maximum phase block and phase block N50 at different coverage titration levels. All values are mean +/- standard deviation, n=10 for each sample at each coverage. Conclusions References and Data Availability - At 15-fold coverage, DeepVariant calls on PacBio HiFi reads achieve ~99% precision and recall for SNVs, increasing to ~99.9% precision and recall at 30-fold coverage. - At 15-fold coverage, DeepVariant calls on PacBio HiFi reads achieve ~95% indel precision and recall, increasing to ~98% precision and recall at 30-fold coverage. - We see very little increase in maximum phase block size or phase block N50 above 15-fold coverage. - This coverage depth can be generated from 2-3 SMRT Cells 8M on the Sequel II System. 1. Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 74, 5463 (2019). 2. Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci Data 3, 160025 (2016). 3. Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol 25, 1097 (2018). https://github.com/google/deepvariant 4. Patterson, M. et al. WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads. Journal of Computational Biology 22, 498–509 (2015). https://whatshap.readthedocs.io/en/latest/ 5. Krusche, P. et al. Best practices for benchmarking germline small- variant calls in human genomes. Nat Biotechnol 37, 555–560 (2019). https://github.com/Illumina/hap.py PacBio HiFi reads generated on Sequel II System using Chemistry 1.0 are deposited in SRA under BioProjects PRJNA527278 , PRJNA540705 , and PRJNA540706 . Alignments to hs37d5 available from GIAB: ftp://ftp.ncbi.nlm.nih.gov//giab/ftp/data/AshkenazimTrio/HG002_NA2438 5_son/PacBio_SequelII_CCS_11kb ftp://ftp.ncbi.nlm.nih.gov//giab/ftp/data/NA12878/PacBio_SequelII_CCS _11kb ftp://ftp.ncbi.nlm.nih.gov//giab/ftp/data/ChineseTrio/HG005_NA24631_s on/PacBio_SequelII_CCS_11kb @nothingclever [email protected] a b Figure 2. Long, highly accurate reads can be mapped unambiguously through difficult regions. Mapped HiFi reads can easily be used to detect and phase small variants, even through repetitive sequence at the CYP2D6/CYP2D7 locus. CYP2D6 CYP2D7 HG002 2x250 bp reads HG002 11 kb HiFi reads haplotype 1 haplotype 2 20 kb

Upload: others

Post on 30-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Detection and phasing of small variants in Genome in a ... · Detection and phasing of small variants in Genome in a Bottle samples with highly accurate long reads William J. Rowell,

For Research Use Only. Not for use in diagnostic procedures. © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. Femto Pulse and Fragment Analyzer are trademarks of Agilent Technologies Inc. All other trademarks are the sole property of their respective owners.

Detection and phasing of small variants in Genome in a Bottle samples with highly accurate long readsWilliam J. Rowell, Shreyasee Chakraborty, Primo Baybayan, Richard J. Hall Pacific Biosciences, 1305 O'Brien Drive, Menlo Park, CA 94025

- Long-read sequencing has been applied successfully to assemble genomes and detect structural variants.

- Long reads can be unambiguously mapped to more of the genome than short reads of comparable accuracy.

- However, it has been challenging to call small variants from long reads due to higher error rates for raw reads.

- PacBio HiFi reads that are >99% accurate and 10-20 kb in length enable detection of small and large variants, increasing discovery power for human genetics research.

- We sequenced the Genome in a Bottle2 (GIAB) reference samples HG001, HG002, and HG005 to ~30-fold coverage to determine the optimal coverage depth for small variant detection and phasing.

Introduction Analysis Results

Figure 1. Circular Consensus Sequencing (CCS).(a) A 10-20 kb linear template sequence is (b) ligated to SMRTbell adapters. (c) DNA polymerase synthesizes complementary sequences to both strands of the original linear template, leading to (d) rolling circle sequencing of the original template. (e) CCS analysis uses the noisy individual subreads to generate (f) a highly accurate consensus sequence (HiFi read). (g) With 8 passes, average accuracy is ~QV30 (99.9%).

Passes

30

0

10

20

40

50

85 15 200 10

g

Subread Passes

CC

S Accuracy

(Phred)

a

b

c

d

e

f HiFi read

Methods

Figure 2. Bioinformatics workflow for read mapping and variant detection. ~30-fold coverage (six SMRT Cells 8M with Sequel II System chemistry 1.0) of highly accurate (average 99.8%) 11 kb reads were mapped to the hg19 (hs37d5) reference with pbmm2. For each sample, aligned reads were randomly down-sampled 10 times for each 10% coverage depth increment. Single nucleotide variants (SNVs) and small indels (<50 bp) were detected using Google DeepVariant v0.8.0 and phased with WhatsHap v0.18. Variant calls were evaluated against GIAB v3.3.2 benchmarks.

Sample Coverage Depth

SNV Recall

SNV Precision

Indel Recall

Indel Precision

HG001 14-fold 98.71% 99.40% 93.46% 95.14%

HG001 29-fold 99.89% 99.87% 98.09% 98.08%

HG002 16-fold 99.16% 99.63% 95.21% 96.76%

HG002 32-fold 99.95% 99.97% 98.86% 99.10%

HG005 16-fold 99.12% 99.59% 96.31% 97.23%

HG005 31-fold 99.92% 99.92% 99.12% 99.04%

Results

Align to reference genome with pbmm2

For each depth titration point, detect small variants with DeepVariant3 and phase with WhatsHap4

For each depth titration point, compare to GIAB small variant benchmark using Hap.py5

Randomly down-sample aligned reads

Table 1. Small variant benchmarks at ~15- and ~30-fold coverage. At ~30-fold coverage, DeepVariant achieved >99.8% precision and recall for SNVs, and >98.0% precision and recall for indels, while at ~15-fold coverage, DeepVariant achieved >98.7% precision and recall for SNVs, and >93.5% precision and recall for indels.

Figure 3. Small variant benchmarks and phasing statistics over titration of coverage depth. (a) Precision and recall for SNVs and indels at different coverage titration levels. (b) Size of maximum phase block and phase block N50 at different coverage titration levels. All values are mean +/- standard deviation, n=10 for each sample at each coverage.

Conclusions

References and Data Availability

- At 15-fold coverage, DeepVariant calls on PacBio HiFi reads achieve ~99% precision and recall for SNVs, increasing to ~99.9% precision and recall at 30-fold coverage.

- At 15-fold coverage, DeepVariant calls on PacBio HiFi reads achieve ~95% indel precision and recall, increasing to ~98% precision and recall at 30-fold coverage.

- We see very little increase in maximum phase block size or phase block N50 above 15-fold coverage.

- This coverage depth can be generated from 2-3 SMRT Cells 8M on the Sequel II System.

1. Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 74, 5463 (2019).

2. Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci Data 3, 160025 (2016).

3. Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol 25, 1097 (2018). https://github.com/google/deepvariant

4. Patterson, M. et al. WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads. Journal of Computational Biology 22, 498–509 (2015). https://whatshap.readthedocs.io/en/latest/

5. Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat Biotechnol 37, 555–560 (2019). https://github.com/Illumina/hap.py

PacBio HiFi reads generated on Sequel II System using Chemistry 1.0 are deposited in SRA under BioProjects PRJNA527278, PRJNA540705, and PRJNA540706.Alignments to hs37d5 available from GIAB:• ftp://ftp.ncbi.nlm.nih.gov//giab/ftp/data/AshkenazimTrio/HG002_NA2438

5_son/PacBio_SequelII_CCS_11kb• ftp://ftp.ncbi.nlm.nih.gov//giab/ftp/data/NA12878/PacBio_SequelII_CCS

_11kb• ftp://ftp.ncbi.nlm.nih.gov//giab/ftp/data/ChineseTrio/HG005_NA24631_s

on/PacBio_SequelII_CCS_11kb

@nothingclever ✉ [email protected]

a

b

Figure 2. Long, highly accurate reads can be mapped unambiguously through difficult regions. Mapped HiFi reads can easily be used to detect and phase small variants, even through repetitive sequence at the CYP2D6/CYP2D7 locus.

CYP2D6 CYP2D7

HG002 2x250 bp

reads

HG002 11 kb HiFi

reads

hapl

otyp

e 1

hapl

otyp

e 2

20 kb