experience of using bwa-mem & gatk haplotypecaller for variant calling ... · gatk...

Post on 17-May-2020

12 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Experience of using BWA-mem & GATK HaplotypeCaller for Variant

Calling in Multiple Rat Strains

Wim Spee, Bio-informatics Engineer, Cuppen Group, Hubrecht Institute

Content

● Project overview: Euratrans (FP7)● Pipeline and data overview● NGS alignment and variant calling● BAC alignment and variant calling● NGS and BAC genotype concordance● Heterozygosity in inbred species● Conclusion & discussion

Euratrans (FP7)

● European consortium for large-scale functional genomics in the rat for translational research

● Rat is a popular model organism● Multiple homozygous inbred disease model

strains have been set up– Example: SHR = Spontaneously hypertensive rat

– Set up before NGS was established by traditional breeding on phenotype

Pipeline and Data Overview

Raw reads =WGS Solid Fragment (50bp) and PE (50bp x 35bp)

Mapping =BWA 0.5.9 colorspace

Duplicate marking =Picard MarkDuplicates

Local realignment =GATK IndelRealigner

BQSR =N/A on BWA mapped Solid

Call variants =GATK HaplotypeCaller multisample

VQSR = SNP array and top 33% of indels as truth variant call set

Variant evaluation = Precision and Recall against 13 aligned Sanger based contigs (2.1 mB, aligned with BWA-MEM)

NGS Alignment and Variant Calling

● “Best practice” BWA-Picard-GATK pipeline– GATK HaplotypeCaller for variant calling

– Variant Quality Score Recalibration (VQSR) using SNP array and top 33% INDELS

HaplotypeCaller (HC) (Theory)

● Local denovo assembly based variant caller– Calls SNP, INDEL, MNP and small SV simultaneously

– Removes mapping artifacts

– More sensitive and accurate than the Unified Genotyper (UG) – Physical phasing of variants

– Used to run on geological timescales– Now runs on practical timescales (v 2.6.3 via Queue on SGE cluster)

● 2 days for 10 SOLID WGS rat strains multi-sample variant calling

Slide taken from Broad presentation

Slide taken from Broad presentation

Slide taken from Broad presentation

Slide taken from Broad presentation

Variant Quality Score Recalibration (VQSR)

● Use known true variants to dynamically set a cutoff between true positive and false positive calls– True positives will cluster together with the known variants and false

positives will mainly be in a separate cluster

– Alternative to setting manual hard cutoffs e.g. (coverage = 20, quality = 50, etc.)

● Known true variants for the rat species:– SNP: 500.000 high quality positions from a SNP array

– INDEL: no external set available, used top 33% (QUAL) in call set

VQSR (SNP): Plots

VQSR (SNP): Truth Sensitivity Tranches

VQSR (SNP): Truth Sensitivity Tranches

VQSR (INDEL): Plots

VQSR (INDEL): Truth Sensitivity Tranches?

Which Tranche to Take?

BAC Contig Alignment and Variant Calling

● 13 BACS for rat strain LE, ca. 150 kB per BAC, 2.1 mB in total

● BAC contig alignment – BWA-MEM

● BAC contig variant calling– GATK Unified Genotyper

● BAC & NGS alignment and variant calls in IGV

● BWA-MEM* – New long read & contig aligner from Heng Li

● 70bp to a few mB

– Can switch between end to end and local alignment

● Supports structural events detection from long reads and contigs

– Outputs a standard BAM file● Useful for downstream processing

BAC Contig Alignment: BWA-MEM

* Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM (arXiv:1303.3997v2)

BWA-MEM Settings

● Seed length– 400 bp

● Banded Alignment (space to search for optimal alignment) = – 5000 positions

GATK UG Settings for Variant Calling on Aligned BAC

● --genotype_likelihoods_model BOTH● -stand_call_conf 0 ● -stand_emit_conf 0 ● -indelGapContinuationPenalty 30● -indelGapOpenPenalty 60● -minIndelCnt 1● -L BACToBedMerged.bed

BAC & NGS Alignment and Calls

BAC & NGS Alignment and Calls(zoomed out)

BAC Multiple Local Alignment

BAC Deletion vs. Reference

Unknown Reference Sequence

Mismatch Between BAC and Reference

SOLID Low Mapping Quality Regions

NGS and BAC Genotype Concordance

● Genotype concordance: – GATK module to compare 2 VCF files

● Input:– NGS call set restricted to BAC region

– BAC call set

● Filters used:– VQSR on (NGS)

– SNP cluster (3 SNP in 10bp window) (NGS and BAC)

– No known repeats regions (NGS and BAC)

– No NGS LE low quality mapping regions (NGS and BAC)

Precision and Recall (SNP)Current: Rnor 5.0, GATK Haplotype caller, VQSR 99.5 BWA-MEM

No clusters

No repeats

No low quality

Match NGS ONLY

BAC ONLY

SNP

Precision Recall

2309 14 1747 99.40% 56.93%

x 2270 22 1157 99.04% 66.24%

x x 2231 18 811 99.20% 73.34%

x x x 1944 10 192 99.49% 91.01%

Comparison:SNP

Precision Recall

Rnor 3.4, modified samtools, BLAT, no cluster, repeat and low qual. Same LE solid data set and BAC.

97.30% 82.80%

Rnor 3.4, GATK UG, simulated reads from BAC. Ilumina LE dataset and same BAC. Additional filters unknown

99.62% 91.90%

Precision and Recall (INDEL) Preliminary!

Current: Rnor 5.0, GATK Haplotype caller, VQSR 99.3, BWA-MEM

INDEL

Precision Recall

Rnor 3.4, modified samtools, BLAT, no cluster, repeat and low qual. Same LE solid data set and BAC.

97.80% 58.60%

Rnor 3.4, GATK UG, simulated reads from BAC. Ilumina LE dataset and same BAC. Additional filters unknown

96.25% 89.02%

Comparison:

INDEL

No repeats

No low quality

GT mismatch

Match NGS ONLY

BAC ONLY

Precision Recall

12 329 109 795 75.11% 29.27%

x 7 287 64 469 81.77% 37.96%

x x 1 182 28 100 86.67% 64.54%

Precision and Recall (INDEL) Improvement

● Include 2 Ilumina sequenced strains in HC variant calling

● And or include INDEL calls based on 2 Ilumina sequenced strains in VQSR– Intersection between Solid and Ilumina INDEL

calls as truth set?

● Better selection of truth INDELS from current call set?

Heterozygosity

● ~10% of LE true positive calls (vs. BAC) are heterozygous– Remaining heterozygosity?

– Paralogous regions?

– Other mapping artifacts?

– Bias of GATK HC towards diploid heterozygous species?

Conclusions & Discussion

● External data sets are very useful (SNP array & BAC for VQSR and genotype concordance)

● GATK Haplotype caller works– Better than samtools based variant calling

– To really compare with GATK UG, run on Ilumina LE strain sample● Solid reads to short to benefit from GATK HC?

– How to improve INDEL VQSR with no external truth set?

– How to handle heterozygous calls in inbred species?

● BWA-mem works– GATK UG can call SNP / INDELS on aligned BACs

– Visualization in IGV

– How to call SVs?

Acknowledgment

Cuppen Group at the Hubrecht Institute

top related