giab sep2016 lightning chen sun varmatch

VarMatch:robust matching of small variant datasets using flexible scoring schemes

Chen Sun, Paul Medvedev

Penn State

1

Variant Matching

• Different pipelines tends to report variants in different representations

• Need to compare VCF files• Evaluate variant callers

• Find overlap as high confident variants

• Add variants into database

• Two variant sets are equivalent if applying them separately to the reference genome results in the same donor genome.

• Variant Matching Problem: given two call sets, identify the largest equivalent subsets.

2

The Variant Matching problem

Seq A G C C G G

1 REF G C C G

ALT C C G A

2 REF G C G

ALT C G A

3 REF A G G

ALT A G A

Donor: A C C G A G

• Naïve approach • Match two variants if location and alleles exactly

same

• Normalization (Tan et al 15)• Guarantees to match equivalent singletons

• Complex Variants• One variant matches multiple variants

• Multiple variants matches multiple variants

• Decomposition (Li 14, Zook et al 14)• Creates fractional matches

• Does not always work (Example )3

VarMatch Algorithm Overview

• Separator on reference genome sequence• Variants on the left can not be equivalent to variants on the right

• Linear scan of reference genome to identify separators• Solve independent small problem• Branch and bound method for small problem

• Similar algorithm as Cleary et al., 2015• Problem size small• Require less memory and time

• Theorem for identifying separators

Software: https://github.com/medvedevgroup/varmatchPreprint: VarMatch: robust matching of small variant datasets using flexible scoring schemes (bioArxiv)

4

https://github.com/medvedevgroup/varmatch

VarMatch supports flexible scoring schemes

• Maximize number of total matched variants or just in the baseline?

• Maximize number of calls or total edit distance?• e.g. a call affecting changes 10 bases vs. 10 calls changing 1 base.

• Require genotypes to match or to just detect a variant is present?

Others possible?

5

Benchmark

CHM1 + bowtie (Li 14)

Freebayes GATK-HC

NA12878 + bowtie (Li 14)

Freebayes GATK-UG

Vt normalize 2,778,372 2,778,372 4,092,161 4,092,161

RTG Tools 2,843,396 2,912,641 4,197,070 4,321,997

VarMatch 2,843,396 2,912,641 4,197,138 4,322,083

RAM(Gb) Time(s)

RTG Tools 48 456

VarMatch 5 302

Memory and Running Time Evaluation

Number of Matched Variants

Matching in low-complexity regions

• Comparison of (1) BWA+FreeBayes and (2) Bowtie2+Platypus NA12878 callsets (Li 14)• Using Bowtie2+GATK as baseline

• Focus on low-complexity region

• 12% more equivalent variants identified using VarMatch than normalization

Results of Vt-normalize Results of VarMatch

Matching in dense regions

• Comparison of Freebayes vs. Platypus NA12878 callsets (Li. 2014)• using GIAB Gold Standard (Zook et al 14) as baseline

• Focus on “dense regions”• 10 base regions that contain an INDEL and another variant

• Assessment genome wide differs from that in dense regions

Number of Matched Variants in Baseline

Freebayes Platypus

genome wide 2,896,841 2,891,849

dense regions 24,188 24,522

Conclusion

• Software: https://github.com/medvedevgroup/varmatch

• Manuscript: VarMatch: robust matching of small variant datasets using flexible scoring schemes (bioArxiv)

9

https://github.com/medvedevgroup/varmatch

Supplementary

10

VarMatch Highlights

• Use less memory and running time

• Better performance matching complex variants

• Better performance in low-complexity regions

• Better performance in dense regions

• Flexible scoring schemes

11

giab sep2016 lightning chen sun varmatch

Health & Medicine