giab sep2016 lightning chen sun varmatch

16
VarMatch: robust matching of small variant datasets using flexible scoring schemes Chen Sun, Paul Medvedev Penn State 1

Upload: genomeinabottle

Post on 17-Jan-2017

74 views

Category:

Health & Medicine


1 download

TRANSCRIPT

Page 1: GIAB Sep2016 Lightning chen sun varmatch

VarMatch:robust matching of small variant datasets using flexible scoring schemes

Chen Sun, Paul Medvedev

Penn State

1

Page 2: GIAB Sep2016 Lightning chen sun varmatch

Variant Matching

• Different pipelines tends to report variants in different representations

• Need to compare VCF files• Evaluate variant callers

• Find overlap as high confident variants

• Add variants into database

• Two variant sets are equivalent if applying them separately to the reference genome results in the same donor genome.

• Variant Matching Problem: given two call sets, identify the largest equivalent subsets.

2

Page 3: GIAB Sep2016 Lightning chen sun varmatch

The Variant Matching problem

Seq A G C C G G

1 REF G C C G

ALT C C G A

2 REF G C G

ALT C G A

3 REF A G G

ALT A G A

Donor: A C C G A G

• Naïve approach • Match two variants if location and alleles exactly

same

• Normalization (Tan et al 15)• Guarantees to match equivalent singletons

• Complex Variants• One variant matches multiple variants

• Multiple variants matches multiple variants

• Decomposition (Li 14, Zook et al 14)• Creates fractional matches

• Does not always work (Example )3

Page 4: GIAB Sep2016 Lightning chen sun varmatch

VarMatch Algorithm Overview

• Separator on reference genome sequence• Variants on the left can not be equivalent to variants on the right

• Linear scan of reference genome to identify separators• Solve independent small problem• Branch and bound method for small problem

• Similar algorithm as Cleary et al., 2015• Problem size small• Require less memory and time

• Theorem for identifying separators

Software: https://github.com/medvedevgroup/varmatchPreprint: VarMatch: robust matching of small variant datasets using flexible scoring schemes (bioArxiv)

4

Page 5: GIAB Sep2016 Lightning chen sun varmatch

VarMatch supports flexible scoring schemes

• Maximize number of total matched variants or just in the baseline?

• Maximize number of calls or total edit distance?• e.g. a call affecting changes 10 bases vs. 10 calls changing 1 base.

• Require genotypes to match or to just detect a variant is present?

Others possible?

5

Page 6: GIAB Sep2016 Lightning chen sun varmatch

Benchmark

CHM1 + bowtie (Li 14)

Freebayes GATK-HC

NA12878 + bowtie (Li 14)

Freebayes GATK-UG

Vt normalize 2,778,372 2,778,372 4,092,161 4,092,161

RTG Tools 2,843,396 2,912,641 4,197,070 4,321,997

VarMatch 2,843,396 2,912,641 4,197,138 4,322,083

RAM(Gb) Time(s)

RTG Tools 48 456

VarMatch 5 302

Memory and Running Time Evaluation

Number of Matched Variants

Page 7: GIAB Sep2016 Lightning chen sun varmatch

Matching in low-complexity regions

• Comparison of (1) BWA+FreeBayes and (2) Bowtie2+Platypus NA12878 callsets (Li 14)• Using Bowtie2+GATK as baseline

• Focus on low-complexity region

• 12% more equivalent variants identified using VarMatch than normalization

Results of Vt-normalize Results of VarMatch

Page 8: GIAB Sep2016 Lightning chen sun varmatch

Matching in dense regions

• Comparison of Freebayes vs. Platypus NA12878 callsets (Li. 2014)• using GIAB Gold Standard (Zook et al 14) as baseline

• Focus on “dense regions”• 10 base regions that contain an INDEL and another variant

• Assessment genome wide differs from that in dense regions

Number of Matched Variants in Baseline

Freebayes Platypus

genome wide 2,896,841 2,891,849

dense regions 24,188 24,522

Page 9: GIAB Sep2016 Lightning chen sun varmatch

Conclusion

• Software: https://github.com/medvedevgroup/varmatch

• Manuscript: VarMatch: robust matching of small variant datasets using flexible scoring schemes (bioArxiv)

9

Page 10: GIAB Sep2016 Lightning chen sun varmatch

Supplementary

10

Page 11: GIAB Sep2016 Lightning chen sun varmatch

VarMatch Highlights

• Use less memory and running time

• Better performance matching complex variants

• Better performance in low-complexity regions

• Better performance in dense regions

• Flexible scoring schemes

11

Page 12: GIAB Sep2016 Lightning chen sun varmatch

12

Page 13: GIAB Sep2016 Lightning chen sun varmatch

13

Page 14: GIAB Sep2016 Lightning chen sun varmatch

14

Page 15: GIAB Sep2016 Lightning chen sun varmatch

15

Page 16: GIAB Sep2016 Lightning chen sun varmatch

16