giab sep2016 lightning chen sun varmatch
TRANSCRIPT
VarMatch:robust matching of small variant datasets using flexible scoring schemes
Chen Sun, Paul Medvedev
Penn State
1
Variant Matching
• Different pipelines tends to report variants in different representations
• Need to compare VCF files• Evaluate variant callers
• Find overlap as high confident variants
• Add variants into database
• Two variant sets are equivalent if applying them separately to the reference genome results in the same donor genome.
• Variant Matching Problem: given two call sets, identify the largest equivalent subsets.
2
The Variant Matching problem
Seq A G C C G G
1 REF G C C G
ALT C C G A
2 REF G C G
ALT C G A
3 REF A G G
ALT A G A
Donor: A C C G A G
• Naïve approach • Match two variants if location and alleles exactly
same
• Normalization (Tan et al 15)• Guarantees to match equivalent singletons
• Complex Variants• One variant matches multiple variants
• Multiple variants matches multiple variants
• Decomposition (Li 14, Zook et al 14)• Creates fractional matches
• Does not always work (Example )3
VarMatch Algorithm Overview
• Separator on reference genome sequence• Variants on the left can not be equivalent to variants on the right
• Linear scan of reference genome to identify separators• Solve independent small problem• Branch and bound method for small problem
• Similar algorithm as Cleary et al., 2015• Problem size small• Require less memory and time
• Theorem for identifying separators
Software: https://github.com/medvedevgroup/varmatchPreprint: VarMatch: robust matching of small variant datasets using flexible scoring schemes (bioArxiv)
4
VarMatch supports flexible scoring schemes
• Maximize number of total matched variants or just in the baseline?
• Maximize number of calls or total edit distance?• e.g. a call affecting changes 10 bases vs. 10 calls changing 1 base.
• Require genotypes to match or to just detect a variant is present?
Others possible?
5
Benchmark
CHM1 + bowtie (Li 14)
Freebayes GATK-HC
NA12878 + bowtie (Li 14)
Freebayes GATK-UG
Vt normalize 2,778,372 2,778,372 4,092,161 4,092,161
RTG Tools 2,843,396 2,912,641 4,197,070 4,321,997
VarMatch 2,843,396 2,912,641 4,197,138 4,322,083
RAM(Gb) Time(s)
RTG Tools 48 456
VarMatch 5 302
Memory and Running Time Evaluation
Number of Matched Variants
Matching in low-complexity regions
• Comparison of (1) BWA+FreeBayes and (2) Bowtie2+Platypus NA12878 callsets (Li 14)• Using Bowtie2+GATK as baseline
• Focus on low-complexity region
• 12% more equivalent variants identified using VarMatch than normalization
Results of Vt-normalize Results of VarMatch
Matching in dense regions
• Comparison of Freebayes vs. Platypus NA12878 callsets (Li. 2014)• using GIAB Gold Standard (Zook et al 14) as baseline
• Focus on “dense regions”• 10 base regions that contain an INDEL and another variant
• Assessment genome wide differs from that in dense regions
Number of Matched Variants in Baseline
Freebayes Platypus
genome wide 2,896,841 2,891,849
dense regions 24,188 24,522
Conclusion
• Software: https://github.com/medvedevgroup/varmatch
• Manuscript: VarMatch: robust matching of small variant datasets using flexible scoring schemes (bioArxiv)
9
Supplementary
10
VarMatch Highlights
• Use less memory and running time
• Better performance matching complex variants
• Better performance in low-complexity regions
• Better performance in dense regions
• Flexible scoring schemes
11
12
13
14
15
16