2014 agbt giab data integration poster 140206

1
SNPs indels Genome in a Bottle Consortium As sequencing moves to clinical applications, assessing accuracy becomes very important. With the Genome in a Bottle Consortium, NIST is developing methods to characterize whole genome Reference Materials that can be used to assess the performance of whole genome sequencing Data from multiple sequencing platforms and runs can be used to understand and compensate for errors and biases of each method We propose a method using 14 datasets for CEPH/HapMap sample NA12878 to find characteristics of highly confident genotype calls and use these characteristics to arbitrate between discordant calls Genome in a Bottle: Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls Justin Zook 1 , Brad Chapman 2 , Oliver Hofmann 2 , Winston Hide 2 , Jason Wang 3 , David Mittelman 3 , Marc Salit 1 1 1 National Institute of Standards and Technology, Gaithersburg, MD 2 Harvard School of Public Health, Cambridge, MA; 3 Arpeggi, Inc., Austin, TX Variant list, Performan ce metrics Sample Preparatio n Sequencing Bioinforma tics Samples NA12878 Data sets Performance assessment using integrated calls Calls hosted on GCAT website www.bioplanet.com/gcat Interactive comparison of bioinformatics methods to our integrated calls Using microarrays to assess performance underestimates FN rate Integrated calls have >20x higher percentage of low complexity regions than microarrays Freebayes has significantly improved its indel calls over the past year: Discussion Genome in a Bottle Consortium New members welcome! • www.genomeinabottle.org Spike- ins Integrating SNPs & indels Overlap of SNP calls for NA12878 between three variant call files. (a) The three variant calls come from: (1) Illumina HiSeq reads mapped with bwa and with variants called by GATK; (2) the same Illumina HiSeq reads mapped with bwa but with variants called by samtools; (3) Complete Genomics called with CGTools 2.0. (b) The samtools calls are replaced by SOLiD 4 reads called with GATK. The gray numbers in parentheses are the numbers of variants that are not filtered in the other datasets. Characteristics of bias used for arbitration Systematic sequencing errors (SSEs) Strand bias Base Quality Rank Sum Local Alignment Distance from end of read Mean position within read Read Position Rank Sum HaplotypeScore Length of aligned reads Mapping problems Mapping Quality Abnormal coverage – CNV Length of aligned reads Abnormal allele balance Allele Balance Quality/Depth Complete Genomics Illumina HiSeq Performance Assessment Within “highly confident” regions, all datasets are highly sensitive and specific Most “false” positives and negatives appear to be microarray errors Dataset #1 Dataset #14 Unified Genotyp er Haploty pe Caller Unified Genotyp er Haploty pe Caller Cort ex Candidate SNP & indel sites Force calls with Unified Genotype r Force de novo assembly with Haplotype Caller Force calls with Unified Genotype r Force de novo assembly with Haplotype Caller Integrate UG and HC calls for dataset #11 Integrate UG and HC calls for dataset #1 Find high-confidence SNP & indel sites HomRe f SNP VQSR Het SNP VQSR HomVa r SNP VQSR HomVa r indel VQSR Het indel VQSR HomRe f indel VQSR HomRe f SNP VQSR Het SNP VQSR HomVa r SNP VQSR HomVa r indel VQSR Het indel VQSR HomRe f indel VQSR Arbitrate using characteristics of mapping and alignment bias and systematic sequencing errors to find consensus SNP & indel sites Indels/Complex Variants Multiple correct representations of complex variants often exist Comparing complex variants is difficult. Try RTG’s vcfeval! Filter sites if <2 datasets are free of bias Genome in a Bottle Consorti um a http://genomeinabottle.org/blog-entry/existing-and-future- na12878-datasets. Pedigree Methods Real Time Genomics and Illumina Platinum Genomes have developed methods to use the 11 children of NA12878 High-confidence variants are in haplotypes that are properly inherited in the children Structural Variants Can we use similar methods for SVs? Arbitrate using coverage, insert size, discordant paired ends, mapping quality, soft-clipping, heterozygous/homozygous ratio, allele fraction, … How to use long-read technologies? CAGTGA > TCTCT complex variant

Upload: genomeinabottle

Post on 24-Jun-2015

631 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 2014 agbt giab data integration poster 140206

SNPs indels

Genome in a Bottle Consortium• As sequencing moves to clinical

applications, assessing accuracy becomes very important.

• With the Genome in a Bottle Consortium, NIST is developing methods to characterize whole genome Reference Materials that can be used to assess the performance of whole genome sequencing

• Data from multiple sequencing platforms and runs can be used to understand and compensate for errors and biases of each method

• We propose a method using 14 datasets for CEPH/HapMap sample NA12878 to find characteristics of highly confident genotype calls and use these characteristics to arbitrate between discordant calls

Genome in a Bottle: Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls

Justin Zook1, Brad Chapman2, Oliver Hofmann2, Winston Hide2, Jason Wang3, David Mittelman3, Marc Salit1

1

1National Institute of Standards and Technology, Gaithersburg, MD2Harvard School of Public Health, Cambridge, MA; 3Arpeggi, Inc., Austin, TX

Variant list,Performance

metrics

SamplePreparation

Sequencing

Bioinformatics

Samples

NA12878 Data sets

Performance assessment using integrated calls• Calls hosted on GCAT website

• www.bioplanet.com/gcat• Interactive comparison of bioinformatics

methods to our integrated calls

• Using microarrays to assess performance underestimates FN rate• Integrated calls have >20x higher percentage

of low complexity regions than microarrays

• Freebayes has significantly improved its indel calls over the past year:

Discussion• Genome in a Bottle Consortium

• New members welcome!• www.genomeinabottle.org

Spike-ins

Integrating SNPs & indels

Overlap of SNP calls for NA12878 between three variant call files.(a) The three variant calls come from: (1) Illumina HiSeq reads mapped with bwa and with variants called by GATK; (2) the same Illumina HiSeq reads mapped with bwa but with variants called by samtools; (3) Complete Genomics called with CGTools 2.0.(b) The samtools calls are replaced by SOLiD 4 reads called with GATK. The gray numbers in parentheses are the numbers of variants that are not filtered in the other datasets.

Characteristics of bias used for arbitration• Systematic sequencing errors (SSEs)

• Strand bias• Base Quality Rank Sum

• Local Alignment• Distance from end of read• Mean position within read• Read Position Rank Sum• HaplotypeScore• Length of aligned reads

• Mapping problems• Mapping Quality• Abnormal coverage – CNV• Length of aligned reads

• Abnormal allele balance• Allele Balance • Quality/Depth

CompleteGenomics

IlluminaHiSeq

Performance Assessment• Within “highly confident” regions, all

datasets are highly sensitive and specific

• Most “false” positives and negatives appear to be microarray errors

…Dataset #1 Dataset #14

UnifiedGenotyper

Haplotype Caller

UnifiedGenotyper

Haplotype CallerCortex

Candidate SNP & indel sites

Force calls with UnifiedGenotyper

Force de novo assembly with

Haplotype Caller

… Force calls with UnifiedGenotyper

Force de novo assembly with

Haplotype Caller

Integrate UG and HC calls for

dataset #11…Integrate UG

and HC calls for dataset #1

Find high-confidence SNP & indel sitesHomRef

SNP VQSR

Het SNP VQSR

HomVar SNP

VQSR

HomVar indel VQSR

Het indel VQSR

HomRef indel VQSR

…HomRef

SNP VQSR

Het SNP VQSR

HomVar SNP

VQSR

HomVar indel VQSR

Het indel VQSR

HomRef indel VQSR

Arbitrate using characteristics of mapping and alignment bias and systematic sequencing errors to find consensus SNP & indel sites

Indels/Complex Variants• Multiple correct

representations of complex variants often exist

• Comparing complexvariants is difficult. Try RTG’s vcfeval!

Filter sites if <2 datasets are free of bias

Genome in a Bottle Consortium

a http://genomeinabottle.org/blog-entry/existing-and-future-na12878-datasets.

Pedigree Methods• Real Time Genomics and Illumina

Platinum Genomes have developed methods to use the 11 children of NA12878

• High-confidence variants are in haplotypes that are properly inherited in the children

Structural Variants• Can we use similar methods for SVs?• Arbitrate using coverage, insert size,

discordant paired ends, mapping quality, soft-clipping, heterozygous/homozygous ratio, allele fraction, …

• How to use long-read technologies?

CAGTGA > TCTCT complex variant