sept2016 sv nist_intro

SV Data Jamboree

Justin Zook and Ali Bashir

With the Genome in a Bottle Consortium

September 15, 2016

Sequencing technologies and bioinformatics pipelines disagree

O’Rawe et al. Genome Medicine 2013, 5:28

Candidate NIST Reference Materials

Genome PGP ID Coriell ID NIST ID NIST RM #

CEPH Mother/Daughter

N/A GM12878 HG001 RM8398

AJ Son huAA53E0 GM24385 HG002 RM8391 (son)/RM8392 (trio)

AJ Father hu6E4515 GM24149 HG003 RM8392 (trio)

AJ Mother hu8E87A9 GM24143 HG004 RM8392 (trio)

Asian Son hu91BD69 GM24631 HG005 RM8393

Asian Father huCA017E GM24694 N/A N/A

Asian Mother hu38168C GM24695 N/A N/A

Data for GIAB PGP TriosDataset Characteristics Coverage Availability Most useful for…

Illumina Paired-end WGS

150x150bp250x250bp

~300x/individual~50x/individual

on SRA/FTP SNPs/indels/some SVs

Complete Genomics 100x/individual on SRA/ftp SNPs/indels/some SVs

SOLiD 5500W WGS 50bp single end 70x/son on FTP SNPs

Illumina Paired-end WES

100x100bp ~300x/individual on SRA/FTP SNPs/indels in exome

Ion Proton Exome 1000x/individual on SRA/FTP SNPs/indels in exome

Illumina Mate pair ~6000 bp insert ~30x/individual on FTP SVs

Illumina “moleculo” Custom library ~30x by long fragments

on FTP SVs/phasing/assembly

Complete Genomics LFR 100x/individual on SRA/FTP SNPs/indels/phasing

10X Pseudo-long reads 30-45x/individual on FTP SVs/phasing/assembly

PacBio ~10kb reads ~70x on AJ son, ~30x on each AJ parent

on SRA/FTP SVs/phasing/assembly/STRs

Oxford Nanopore 5.8kb 2D reads 0.02x on AJ son on FTP SVs/assembly

Nabsys 2.0 ~100kbp N50 nanopore maps

70x on AJ son SVs/assembly

BioNano Genomics 200-250kbp optical map reads

~100x/AJ individual; 57x on Asian son

on FTP SVs/assembly

Paper describing data…

51 authors14 institutions12 datasets7 genomesData described in ISA-tab

Integration Methods to Establish Benchmark Small Variant Calls

Candidate variants

Concordant variants

Find characteristics of bias

Arbitrate using evidence of bias

Confidence LevelZook et al., Nature Biotechnology, 2014.

How can we extend this approach to SVs?

Similarities to small variants

• Collect callsets from multiple technologies

• Compare callsets to find calls supported by multiple technologies

Differences from small variants

• Callsets generally are not sufficiently sensitive to assume that regions without calls are homozygous reference– SVs of different types/sizes are

not always detected easily

• Variants are often imprecisely characterized– breakpoints, size, type, etc.

• Representation of variants is poorly standardized, especially when complex

• Comparison tools in infancy

Callsets Contributed so far

Short reads

• Illumina– Spiral Genetics

– cortex

– Commonlaw

– MetaSV

• Complete Genomics

• CG-SV

• CG-CNV

• CG-vcfBeta

Long reads and Linked reads• PacBio

• CSHL-assembly• Sniffles• PBHoney-spots and –tails• Parliament/pacbio• Parliament/assembly• MultibreakSV• smrt-sv.dip• Assemblytics-Falcon and-MHAP• NHGRI assembly-based

• Nanopore mapping• Nabsys force calls

• optical mapping• BioNano with and without haplotype-aware

assembly

• 10X Genomics Chromium• Deletions• Large SVs

AJ Trio Assemblies

On FTP

• PacBio– Falcon

– Canu

• BioNano– Haploid

– Diploid

In Process

• Illumina– DISCOVAR – contig N50 ~100k

• PacBio– Falcon diploid in process

• Dovetail scaffolding– With PacBio-falcon

– With PacBIo-Canu

– With DISCOVAR

• 10X?– By itself

– Phasing PacBio

APPROACH #1: FIND DELETIONS WITH SUPPORT FROM MULTIPLE TECHS AND CONCORDANT BREAKPOINTS

Step 1: Merging calls• Process

– Find union of calls >19bp from all deletion callsets and merge any regions if within 1000 bp (results in 28460 regions)

– Annotate each merged region with fraction covered by calls from each callset

– Split out those overlapping tandem repeats longer than 200bp by >25% (2715 regions)

• Helps mitigate different representations of calls in repetitive regions and imprecision of breakpoints from many callers

• Limitations– may not appropriately call compound heterozygous SVs– Ignores other types of SVs in the region– Loses genotype information

Callset #1

Callset #2

Step 2: Find size prediction accuracy

• Find “size prediction accuracy” of each callsetby calculating the difference from the median predicted size for regions with calls from >3 callers, and rank callers for <3kb and >3kb size ranges

Spiral 0.00%

Cortex 0.24%

CGSV 0.65%

AssemblyticsFalcon 0.79%

CGvcf 1.09%

fermikit 1.28%

smrtsvdip 1.43%

MetaSV 1.57%

MultibreakSV 1.62%

PBHoneySpots 2.13%

AssemblyticsMHAP 2.21%

ParliamentAssemblyForce 2.26%

CSHLassembly 2.29%

ParliamentPacBio 2.92%

ParliamentAssembly 3.00%

Spiral 0.04%

AssemblyticsFalcon 0.06%

CGSV 0.06%

CSHLassembly 0.08%

AssemblyticsMHAP 0.08%

MultibreakSV 0.10%

fermikit 0.11%

PBHoneyTails 0.38%

CommonLaw 0.48%

ParliamentPacBio 0.58%

smrtsvdip 0.62%

MetaSV 1.12%

sniffles 1.57%

Nabsys2tech01Force 3.02%

BioNano 3.67%

Size >3kbSize <3kb

IMPORTANT NOTE: These stats are intended for integration and to help developers improve their methods, not to compare methods, since they likely do not reflect actual size prediction accuracy for all methods.

Step 3: Find calls supported by 2 techs

1. Find calls supported by calls from 2 or more technologies with size prediction within 20%

2. Find sensitivity of each caller to these calls in size ranges 20-50, 50-100, 100-1000, 1000-3000, and >3000 bp

Step 4: Filter questionable calls supported by 2+ technologies

• 316 calls covered >25% by segmental duplication >10kb

• 631 calls with at least one caller predicting a size >2x different from the consensus size

• 34 calls where callsets missing this call from multiple technologies have a multiplied (1-sensitivity) < 2% in this size tranche

• 87 calls that overlap Ns in the reference

Overview of process

Merge deletions

within 1kb

Rank calls by closeness of

predicted size to

median size and select call in each region from best callset

Find calls supported

by 2+ technologies

with size within 20%

Filter calls overlapping

seg dups, reference

N’s, or with call with

predicted size 2x larger

Number of Calls Supported by 2 Technologies by Size Range

<50bp 50-100bp 100-1000bp 1kb-3kb >3kbpre-filtered 2542 1567 2447 731 730

filtered 2427 1415 2207 638 524

Size distributions

Support for all candidate regions

# of callsets # of technologies

Support for benchmark calls

# of callsets # of technologies

Approach #2: svcompare (NCBI hackathon)

Builds on SURVIVOR• Compares each new callset to

the first and adds new calls not within 1kb of existing calls

• Outputs multi-sample vcf with type, size, and breakpoints from each callset in each candidate region

• Integrates multiple types, but doesn’t currently output size of insertions or exact sequence

• Developed by Fritz Sedlazeck, JHU

Output stats

• 130k input regions from calls >19bp

• 876 regions have >1 type within a callset

• 2276 regions have >1 type across callsets

• How to integrate discordant types in same region?

https://github.com/NCBI-Hackathons/svcompare

Example start position distance from median start by callset (400-1000bp)

Approach #3: “Type” candidate calls in each dataset

svviz

• Looks for whether reads support REF or ALT allele– Can often easily infer

genotype

• Also generates other stats about mapping reads

• Generates visualization of mapped reads as well

• Nabsys has developed a similar approach for their mapping data

Compatible datasets

• PacBio

• Illumina 150bp and 250bp paired end

• Illumina 6kb mate-pair

• 10X haplotype-separated

10X SV analyseswith svviz

• Find reads supporting ref and alt alleles in each haplotype

• Verify support for ref and alt is on different haplotypes for hets

• Verify support from both haplotypes for confidence homo var or hom ref call

Son

Dad

Mo

mSo

nD

adM

om

Goals for Data Jamboree

Share progress in algorithm development

• New technologies

• New analysis methods

• Visualization methods

• Integration/comparison methods

Outstanding questions to discuss• Integration

– How to form high-confidence calls, breakpoints, and genotypes from multiple calls?

– What is the minimum viable product for a practical benchmark set?• Is this a good criterion: “When an

individual callset is compared to ours, most FPs/FNs should be errors in the individual callset”

– How to handle non-deletions?

• SV typing• Future work

– How to form high-confidence regions?

– SV phasing– Is anyone developing SV

benchmarking tools?

Things to resolve

Integration

• How to compare events with variable breakpoints across callsets?– Tandem repeats

• How to compare non-deletions?– Start with insertions?

• Distinguish precise breakpoints when possible

Typing

• Leverage long-range information to type with short reads?

• How to deal with imprecise breakpoints?

• At what point is something validated?– Potentially high-confidence

variants (or reference?)

– Haplotype-separated

Acknowledgements

• NIST– Marc Salit– Jenny McDaniel– Lindsay Vang– David Catoe– Hemang Parikh

• Genome in a Bottle Consortium• GA4GH Benchmarking Team

• FDA– Liz Mansfield

• SV Callset Contributors– CSHL/JHU– Mt Sinai– 10X– Nabsys– Spiral Genetics/Stanford– Heng Li/Mike Lin– DNAnexus– Complete Genomics– Baylor– Bina/Roche– BioNano Genomics– Mark Chaisson– NIH/NCBI– NIH/NHGRI– Can Alkan/Stanford

sept2016 sv nist_intro

Health & Medicine