giab sep2016 lightning chen sun varmatch

VarMatch:robust matching of small variant datasets using flexible scoring schemes

Chen Sun, Paul Medvedev

Penn State

Variant Matching

• Different pipelines tends to report variants in different representations

• Need to compare VCF files• Evaluate variant callers

• Find overlap as high confident variants

• Add variants into database

• Two variant sets are equivalent if applying them separately to the reference genome results in the same donor genome.

• Variant Matching Problem: given two call sets, identify the largest equivalent subsets.

The Variant Matching problem

Seq A G C C G G

1 REF G C C G

ALT C C G A

2 REF G C G

ALT C G A

3 REF A G G

ALT A G A

Donor: A C C G A G

• Naïve approach • Match two variants if location and alleles exactly

• Normalization (Tan et al 15)• Guarantees to match equivalent singletons

• Complex Variants• One variant matches multiple variants

• Multiple variants matches multiple variants

• Decomposition (Li 14, Zook et al 14)• Creates fractional matches

• Does not always work (Example )3

VarMatch Algorithm Overview

• Separator on reference genome sequence• Variants on the left can not be equivalent to variants on the right

• Linear scan of reference genome to identify separators• Solve independent small problem• Branch and bound method for small problem

• Similar algorithm as Cleary et al., 2015• Problem size small• Require less memory and time

• Theorem for identifying separators

Software: https://github.com/medvedevgroup/varmatchPreprint: VarMatch: robust matching of small variant datasets using flexible scoring schemes (bioArxiv)

VarMatch supports flexible scoring schemes

• Maximize number of total matched variants or just in the baseline?

• Maximize number of calls or total edit distance?• e.g. a call affecting changes 10 bases vs. 10 calls changing 1 base.

• Require genotypes to match or to just detect a variant is present?

Others possible?

Benchmark

CHM1 + bowtie (Li 14)

Freebayes GATK-HC

NA12878 + bowtie (Li 14)

Freebayes GATK-UG

Vt normalize 2,778,372 2,778,372 4,092,161 4,092,161

RTG Tools 2,843,396 2,912,641 4,197,070 4,321,997

VarMatch 2,843,396 2,912,641 4,197,138 4,322,083

RAM(Gb) Time(s)

RTG Tools 48 456

VarMatch 5 302

Memory and Running Time Evaluation

Number of Matched Variants

Matching in low-complexity regions

• Comparison of (1) BWA+FreeBayes and (2) Bowtie2+Platypus NA12878 callsets (Li 14)• Using Bowtie2+GATK as baseline

• Focus on low-complexity region

• 12% more equivalent variants identified using VarMatch than normalization

Results of Vt-normalize Results of VarMatch

Matching in dense regions

• Comparison of Freebayes vs. Platypus NA12878 callsets (Li. 2014)• using GIAB Gold Standard (Zook et al 14) as baseline

• Focus on “dense regions”• 10 base regions that contain an INDEL and another variant

• Assessment genome wide differs from that in dense regions

Number of Matched Variants in Baseline

Freebayes Platypus

genome wide 2,896,841 2,891,849

dense regions 24,188 24,522

Conclusion

• Software: https://github.com/medvedevgroup/varmatch

• Manuscript: VarMatch: robust matching of small variant datasets using flexible scoring schemes (bioArxiv)

Supplementary

VarMatch Highlights

• Use less memory and running time

• Better performance matching complex variants

• Better performance in low-complexity regions

• Better performance in dense regions

• Flexible scoring schemes

giab sep2016 lightning chen sun varmatch

Health & Medicine

vivid_corporate presentation-sep2016

150224 giab 30 min generic slides

henley gdpr transistion programme brochure - sep2016

bonaire english sep2016.cdr

bentleys the voice full report 19 sep2016

jan2016 nabsys giab

140128 use cases of giab rms

giab workshop intro 180125

161115 precision fda giab

jan2016 dnanexus giab uses andrew carroll

160628 giab for festival of genomics

in company profile-ss-edited-for karn-sep2016-r1

medical device reporting 27 sep2016

giab grc workshop slides

giab ashg 2017

giab aug2015 intro and update 150821.pptx

continuing studies into new lightning phenomena...

140127 giab update and nist high-confidence calls

160627 giab for festival sv workshop

highly-accurate long-read sequencing improves variant...