Download - GIAB Sep2016 Lightning chen sun varmatch
![Page 1: GIAB Sep2016 Lightning chen sun varmatch](https://reader034.vdocument.in/reader034/viewer/2022042907/587da82b1a28ab22148b81bd/html5/thumbnails/1.jpg)
VarMatch:robust matching of small variant datasets using flexible scoring schemes
Chen Sun, Paul Medvedev
Penn State
1
![Page 2: GIAB Sep2016 Lightning chen sun varmatch](https://reader034.vdocument.in/reader034/viewer/2022042907/587da82b1a28ab22148b81bd/html5/thumbnails/2.jpg)
Variant Matching
• Different pipelines tends to report variants in different representations
• Need to compare VCF files• Evaluate variant callers
• Find overlap as high confident variants
• Add variants into database
• Two variant sets are equivalent if applying them separately to the reference genome results in the same donor genome.
• Variant Matching Problem: given two call sets, identify the largest equivalent subsets.
2
![Page 3: GIAB Sep2016 Lightning chen sun varmatch](https://reader034.vdocument.in/reader034/viewer/2022042907/587da82b1a28ab22148b81bd/html5/thumbnails/3.jpg)
The Variant Matching problem
Seq A G C C G G
1 REF G C C G
ALT C C G A
2 REF G C G
ALT C G A
3 REF A G G
ALT A G A
Donor: A C C G A G
• Naïve approach • Match two variants if location and alleles exactly
same
• Normalization (Tan et al 15)• Guarantees to match equivalent singletons
• Complex Variants• One variant matches multiple variants
• Multiple variants matches multiple variants
• Decomposition (Li 14, Zook et al 14)• Creates fractional matches
• Does not always work (Example )3
![Page 4: GIAB Sep2016 Lightning chen sun varmatch](https://reader034.vdocument.in/reader034/viewer/2022042907/587da82b1a28ab22148b81bd/html5/thumbnails/4.jpg)
VarMatch Algorithm Overview
• Separator on reference genome sequence• Variants on the left can not be equivalent to variants on the right
• Linear scan of reference genome to identify separators• Solve independent small problem• Branch and bound method for small problem
• Similar algorithm as Cleary et al., 2015• Problem size small• Require less memory and time
• Theorem for identifying separators
Software: https://github.com/medvedevgroup/varmatchPreprint: VarMatch: robust matching of small variant datasets using flexible scoring schemes (bioArxiv)
4
![Page 5: GIAB Sep2016 Lightning chen sun varmatch](https://reader034.vdocument.in/reader034/viewer/2022042907/587da82b1a28ab22148b81bd/html5/thumbnails/5.jpg)
VarMatch supports flexible scoring schemes
• Maximize number of total matched variants or just in the baseline?
• Maximize number of calls or total edit distance?• e.g. a call affecting changes 10 bases vs. 10 calls changing 1 base.
• Require genotypes to match or to just detect a variant is present?
Others possible?
5
![Page 6: GIAB Sep2016 Lightning chen sun varmatch](https://reader034.vdocument.in/reader034/viewer/2022042907/587da82b1a28ab22148b81bd/html5/thumbnails/6.jpg)
Benchmark
CHM1 + bowtie (Li 14)
Freebayes GATK-HC
NA12878 + bowtie (Li 14)
Freebayes GATK-UG
Vt normalize 2,778,372 2,778,372 4,092,161 4,092,161
RTG Tools 2,843,396 2,912,641 4,197,070 4,321,997
VarMatch 2,843,396 2,912,641 4,197,138 4,322,083
RAM(Gb) Time(s)
RTG Tools 48 456
VarMatch 5 302
Memory and Running Time Evaluation
Number of Matched Variants
![Page 7: GIAB Sep2016 Lightning chen sun varmatch](https://reader034.vdocument.in/reader034/viewer/2022042907/587da82b1a28ab22148b81bd/html5/thumbnails/7.jpg)
Matching in low-complexity regions
• Comparison of (1) BWA+FreeBayes and (2) Bowtie2+Platypus NA12878 callsets (Li 14)• Using Bowtie2+GATK as baseline
• Focus on low-complexity region
• 12% more equivalent variants identified using VarMatch than normalization
Results of Vt-normalize Results of VarMatch
![Page 8: GIAB Sep2016 Lightning chen sun varmatch](https://reader034.vdocument.in/reader034/viewer/2022042907/587da82b1a28ab22148b81bd/html5/thumbnails/8.jpg)
Matching in dense regions
• Comparison of Freebayes vs. Platypus NA12878 callsets (Li. 2014)• using GIAB Gold Standard (Zook et al 14) as baseline
• Focus on “dense regions”• 10 base regions that contain an INDEL and another variant
• Assessment genome wide differs from that in dense regions
Number of Matched Variants in Baseline
Freebayes Platypus
genome wide 2,896,841 2,891,849
dense regions 24,188 24,522
![Page 9: GIAB Sep2016 Lightning chen sun varmatch](https://reader034.vdocument.in/reader034/viewer/2022042907/587da82b1a28ab22148b81bd/html5/thumbnails/9.jpg)
Conclusion
• Software: https://github.com/medvedevgroup/varmatch
• Manuscript: VarMatch: robust matching of small variant datasets using flexible scoring schemes (bioArxiv)
9
![Page 10: GIAB Sep2016 Lightning chen sun varmatch](https://reader034.vdocument.in/reader034/viewer/2022042907/587da82b1a28ab22148b81bd/html5/thumbnails/10.jpg)
Supplementary
10
![Page 11: GIAB Sep2016 Lightning chen sun varmatch](https://reader034.vdocument.in/reader034/viewer/2022042907/587da82b1a28ab22148b81bd/html5/thumbnails/11.jpg)
VarMatch Highlights
• Use less memory and running time
• Better performance matching complex variants
• Better performance in low-complexity regions
• Better performance in dense regions
• Flexible scoring schemes
11
![Page 12: GIAB Sep2016 Lightning chen sun varmatch](https://reader034.vdocument.in/reader034/viewer/2022042907/587da82b1a28ab22148b81bd/html5/thumbnails/12.jpg)
12
![Page 13: GIAB Sep2016 Lightning chen sun varmatch](https://reader034.vdocument.in/reader034/viewer/2022042907/587da82b1a28ab22148b81bd/html5/thumbnails/13.jpg)
13
![Page 14: GIAB Sep2016 Lightning chen sun varmatch](https://reader034.vdocument.in/reader034/viewer/2022042907/587da82b1a28ab22148b81bd/html5/thumbnails/14.jpg)
14
![Page 15: GIAB Sep2016 Lightning chen sun varmatch](https://reader034.vdocument.in/reader034/viewer/2022042907/587da82b1a28ab22148b81bd/html5/thumbnails/15.jpg)
15
![Page 16: GIAB Sep2016 Lightning chen sun varmatch](https://reader034.vdocument.in/reader034/viewer/2022042907/587da82b1a28ab22148b81bd/html5/thumbnails/16.jpg)
16