171114 best practices for benchmarking variant calls justin
Post on 21-Jan-2018
118 Views
Preview:
TRANSCRIPT
Best practices for benchmarking variant calls
Justin Zook and the GA4GH Benchmarking Team
NIST Genome-Scale Measurements Group
Joint Initiative for Metrology in Biology (JIMB)
Genome in a Bottle Consortium
November 14, 2017
Take-home Messages
• Benchmarking variant calls is easy to do incorrectly
• The GA4GH Benchmarking Team has developed a set of public tools for robust, standardized benchmarking of variant calls
• Benchmarking results should be interpreted critically
• Ongoing work on difficult variants and regions
Why are we doing this work?
• Technologies evolving rapidly
• Different sequencing and bioinformatics methods give different results
• Now have concordance in easy regions, but not in difficult regions
• Challenge:– How do we benchmark variants in a
6 billion base-pair genome?
O’Rawe et al, Genome Medicine, 2013https://doi.org/10.1186/gm432
Genome in a Bottle ConsortiumAuthoritative Characterization of Human Genomes
Sample
gDNA isolation
Library Prep
Sequencing
Alignment/Mapping
Variant Calling
Confidence Estimates
Downstream Analysis
• gDNA reference materials to evaluate performance
• established consortium to develop reference materials, data, methods, performance metrics
gen
eric
me
asu
rem
en
t p
roce
ss
www.slideshare.net/genomeinabottle
Bringing Principles of Metrologyto the Genome
• Reference materials
– DNA in a tube you can buy from NIST
• Extensive state-of-the-art characterization
– arbitrated “gold standard” calls for SNPs, small indels
• “Upgradable” as technology develops
• PGP genomes suitable for commercial derived products
• Developing benchmarking tools and software
– with GA4GH
• Samples being used to develop and demonstrate new technology
Benchmarking the GIAB benchmarks
• Compare high-confidence calls to other callsets and manually inspect subset of differences
– vs. pedigree-based calls
– vs. common pipelines
– Trio analysis
• When benchmarking a new callset against ours, most putative FPs/FNs should actually be FPs/FNs
Manual curation is required
Evolution of high-confidence calls
CallsHC
Regions HC CallsHC
indelsConcordant
with PG
NIST-only in beds
PG-only in beds PG-only
Variants Phased
v2.19 2.22 Gb 3153247 352937 3030703 87 404 1018795 0.3%v3.2.2 2.53 Gb 3512990 335594 3391783 57 52 657715 3.9%v3.3 2.57 Gb 3566076 358753 3441361 40 60 608137 8.8%v3.3.2 2.58 Gb 3691156 487841 3529641 47 61 469202 99.6%
5-7 errors in NIST
1-7 errors in NIST
~2 FPs and ~2 FNs per million NIST variants in PG and NIST bed files
Global Alliance for Genomics and Health Benchmarking Task Team
• Developed standardized definitions for performance metrics like TP, FP, and FN.
• Developing sophisticated benchmarking tools
• Integrated into a single framework with standardized inputs and outputs
• Standardized bed files with difficult genome contexts for stratification
https://github.com/ga4gh/benchmarking-tools
Variant types can change when decomposing or recomposing variants:
Complex variant:chr1 201586350 CTCTCTCTCT CA
DEL + SNP:
chr1 201586350 CTCTCTCTCT C
chr1 201586359 T A
Credit: Peter Krusche, IlluminaGA4GH Benchmarking Team
Why are definitions important?
Challenges
• Genotype comparisons don’t naturally fall into 2 categories as required for sensitivity, precision, and specificity
• Sometimes variants are partially called and/or partially filtered
• Clustered variants can be counted individually or as a single complex event
• How should filtered variants or “no-call” sites be treated?
Example cases
• Truth is a heterozygous SNP but vcf has a homozygous SNP– 1 FP, 1 FN, and 1 Genotype mismatch
• Truth is an indel but vcf has a SNP at same position– 1 FP, 1 FN, and 1 allele mismatch
• Truth is a deletion + SNP but vcf has the deletion only– 1 TP and 1 FN, or 1 FP and 1-2 FNs,
depending on representations and comparison method
Why are sophisticated comparison tools needed?Normalization isn’t sufficient
Comparison methods affect performance metrics
• Some callers are affected by the comparison method more than others–Biggest effect from clustering nearby variants
GA4GH Reference Implementation
Truth VCF
Query VCF
Comparison Enginevcfeval / vgraph / xcmp / bcftools / ...
VCF-I
Quantification
quantify / hap.py
Stratification BEDfiles
Confident CallRegions
VCF-R
Counts / ROCs
HTML Report e.g. for precisionFDA
Workflow output
Benchmarking example: NA12878 / GiaB / 50X / PCR-Free / Hiseq2000
https://illumina.box.com/s/vjget1dumwmy0re19usetli2teucjel1
Credit: Peter Krusche, IlluminaGA4GH Benchmarking Team
Benchmarking ToolsStandardized comparison, counting, and stratification with Hap.py + vcfeval
https://precision.fda.gov/https://github.com/ga4gh/benchmarking-tools
FN rates high in some tandem repeats
1x0.3x 10x3x 30x1
1 t
o 5
0 b
p5
1 t
o 2
00
bp
2bp unit repeat
3bp unit repeat
4bp unit repeat
2bp unit repeat
3bp unit repeat
4bp unit repeat
FN rate vs. average
Benchmarking stats can be difficult to interpretExample: FN SNPs in coding regions
RefSeq Coding Regions
• Studies often focus on variants in coding regions
• We look at FN SNP rates for bwa-GATK using the decoy
SNP benchmarking stats vs. PG and 3.3.2
• 97.98% sensitivity vs. PG– FNs predominately in low MQ and/or
segmental duplication regions
– ~80% of FNs supported by long or linked reads
• 99.96% sensitivity vs. NISTv3.3.2– 62x lower FN rate than vs PG
• As always, true sensitivity is unknown
Benchmarking stats can be difficult to interpretExample: FN SNPs in coding regions
RefSeq Coding Regions
• Studies often focus on variants in coding regions
• We look at FN SNP rates for bwa-GATK using the decoy
SNP benchmarking stats vs. PG and 3.3.2
• 97.98% sensitivity vs. PG– FNs predominately in low MQ and/or
segmental duplication regions
– ~80% of FNs supported by long or linked reads
• 99.96% sensitivity vs. NISTv3.3.2– 62x lower FN rate than vs PG
• As always, true sensitivity is unknown
True accuracy is hard to estimate, especially in
difficult regions
Benchmarking against each GIAB genomeGenome Type Subset 100% -
recall100% - precision Recall Precision Fraction of calls
outside high-conf bed
HG001 SNP all 0.0277 0.1274 0.9997 0.9987 0.1653
HG002 SNP all 0.0664 0.1342 0.9993 0.9987 0.1910
HG003 SNP all 0.0625 0.1489 0.9994 0.9985 0.1967
HG004 SNP all 0.0633 0.1592 0.9994 0.9984 0.1975
HG005 SNP all 0.1175 0.0870 0.9988 0.9991 0.1834
HG001 SNP notinalldifficultregions 0.0096 0.0783 0.9999 0.9992 0.0491
HG002 SNP notinalldifficultregions 0.0102 0.0576 0.9999 0.9994 0.0864
HG003 SNP notinalldifficultregions 0.0128 0.0819 0.9999 0.9992 0.0864
HG004 SNP notinalldifficultregions 0.0102 0.0860 0.9999 0.9991 0.0854
HG005 SNP notinalldifficultregions 0.0931 0.0541 0.9991 0.9995 0.0664
HG001 INDEL all 0.8354 0.7458 0.9916 0.9925 0.4485
HG002 INDEL all 0.8271 0.7016 0.9917 0.9930 0.4547
HG003 INDEL all 0.7546 0.6523 0.9925 0.9935 0.4632
HG004 INDEL all 0.7345 0.6390 0.9927 0.9936 0.4592
HG005 INDEL all 0.9840 0.7418 0.9902 0.9926 0.4850
HG001 INDEL notinalldifficultregions 0.0551 0.1475 0.9994 0.9985 0.1927
HG002 INDEL notinalldifficultregions 0.0497 0.0893 0.9995 0.9991 0.2208
HG003 INDEL notinalldifficultregions 0.0508 0.1627 0.9995 0.9984 0.2229
HG004 INDEL notinalldifficultregions 0.0496 0.1307 0.9995 0.9987 0.2190
HG005 INDEL notinalldifficultregions 0.1182 0.1535 0.9988 0.9985 0.2049
Approaches to Benchmarking Variant Calling
• Well-characterized whole genome Reference Materials
• Many samples characterized in clinically relevant regions
• Synthetic DNA spike-ins
• Cell lines with engineered mutations
• Simulated reads
• Modified real reads
• Modified reference genomes
• Confirming results found in real samples over time
Challenges in Benchmarking Variant Calling
• It is difficult to do robust benchmarking of tests designed to detect many analytes (e.g., many variants)
• Easiest to benchmark only within high-confidence bed file, but…
• Benchmark calls/regions tend to be biased towards easier variants and regions
– Some clinical tests are enriched for difficult sites
• Can you predict your performance for clinical variants of interest based on sequencing reference samples?
Best Practices for BenchmarkingBenchmark sets Use benchmark sets with both high-confidence variant calls as well as high-confidence regions, so that both false negatives and
false positives can be assessed.
Stringency of variant comparison
Determine whether it is important that the genotype match exactly, only the allele matches, or the call just needs to be near the true variant.
Variant comparison tools
Use sophisticated variant comparison engines such as vcfeval, xcmp, or varmatch that are able to determine if different representations of the same variant are consistent with the benchmark call. Subsetting by high-confidence regions and, if desired, targeted regions, should only be done after comparison to avoid problems comparing variants with differing representations.
Manual curation Manually curate alignments, ideally from multiple data types, around at least a subset of putative false positive and false negative calls in order to ensure they are truly errors in the user’s callset and to understand the cause(s) of errors. Report back to benchmark set developers any potential errors found in the benchmark set (e.g., using https://goo.gl/forms/ECbjHY7nhz0hrCR52for GIAB).
Interpretation of metrics
All performance metrics should only be interpreted with respect to the limitations of the variants and regions in the benchmark set. Performance metrics are likely to be lower for more difficult variant types and regions that are not fully represented in the benchmark set, such as those in repetitive or difficult-to-map regions. When comparing methods, method 1 may perform better in the high-confidence regions, but method 2 may perform better for more difficult variants outside the high-confidence regions.
Stratification Overall performance metrics can be useful, but for many applications it is important to assess performance for particular variant types and genome contexts. Performance often varies significantly across variant types and genome contexts, and stratification allows users to understand this. In addition, stratification allows users to see if some variant types and genome contexts of interest are not sufficiently represented.
Confidence Intervals
Confidence intervals for performance metrics such as precision and recall should be calculated. This is particularly critical for the smaller numbers of variants found when benchmarking in targeted regions and/or less common stratified variant types and regions.
Ongoing and Future Work
• Characterizing difficult variants and regions– Large indels and structural variants
– Tandem repeats and homopolymers
– Difficult to map regions
– Complex variants
• New germline samples– Additional ancestries
• Tumor/normal cell lines– Developing IRB protocol for broadly-consented samples
Acknowledgements
• NIST/JIMB
– Marc Salit
– Jenny McDaniel
– Lindsay Vang
– David Catoe
– Lesley Chapman
• Genome in a Bottle Consortium
• GA4GH Benchmarking Team
• FDA
For More Informationwww.genomeinabottle.org - sign up for general GIAB and Analysis Team google group emails
github.com/genome-in-a-bottle – Guide to GIAB data & ftp
www.slideshare.net/genomeinabottle
www.ncbi.nlm.nih.gov/variation/tools/get-rm/ - Get-RM Browser
Data: http://www.nature.com/articles/sdata201625
Global Alliance Benchmarking Team– https://github.com/ga4gh/benchmarking-tools– Web-based implementation at precision.fda.gov
Public workshops – Next workshop Jan 25-26, 2018 at Stanford University, CA, USA
NIST/JIMB postdoc opportunities available!Justin Zook: jzook@nist.govMarc Salit: salit@nist.gov
top related