mpg ngs workshop - broad institute · 2015. 10. 16. · svs typically by lane typically multiple...

26
MPG NGS workshop: SNP calling and error modeling February 2011 Ryan Poplin Genome Sequencing and Analysis Medical and Population Genetics

Upload: others

Post on 26-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MPG NGS workshop - Broad Institute · 2015. 10. 16. · SVs Typically by lane Typically multiple samples simultaneously but can be single sample alone Input ... Beagle Unified Genotyper

MPG NGS workshop: SNP calling and error modeling

February 2011

Ryan Poplin Genome Sequencing and Analysis Medical and Population Genetics

Page 2: MPG NGS workshop - Broad Institute · 2015. 10. 16. · SVs Typically by lane Typically multiple samples simultaneously but can be single sample alone Input ... Beagle Unified Genotyper

The paradigm today

SNPs

Indels

Structural variation (SV)

Rawindels

RawSVs

Typically by lane Typically multiple samples simultaneously but can be single sample alone

Input

Output

Mapping

Local realignment

Duplicate marking

Base quality recalibration

Analysis-ready reads

Raw reads Sample 1 reads

Raw variants

RawSNPs

Genotype refinement

Variant quality recalibration

Analysis-ready variants

Pedigrees Known variation

Known genotypes

Population structure

Phase 1: NGS data processing Phase 2: Variant discovery and genotyping Phase 3: Integrative analysis

Sample N reads

External data

Page 3: MPG NGS workshop - Broad Institute · 2015. 10. 16. · SVs Typically by lane Typically multiple samples simultaneously but can be single sample alone Input ... Beagle Unified Genotyper

Step 2: SNP discovery

Genotype Likelihoo

ds Calculatio

n

Analysis-ready BAMs

•  We note that we no longer use any hard-filters (proximity to indel calls, clustered SNPs, etc.) at any point in the process.

•  Unified Genotyper math and command lines discussed in previous meetings. (see Appendix for full details)

Allele Frequenc

y Calculatio

n

Variant Quality

Recalibration

Beagle

Unified Genotyper

Page 4: MPG NGS workshop - Broad Institute · 2015. 10. 16. · SVs Typically by lane Typically multiple samples simultaneously but can be single sample alone Input ... Beagle Unified Genotyper

Step 3: SNP discovery

Genotype Likelihoo

ds Calculatio

n

•  The variant quality recalibration process has gone through a major overhaul recently. Most notably, we have removed any dependency on Ti/Tv in the calculation. This and further changes are highlighted in the following slides.

•  Outline: •  Quick Variant Recalibration overview •  Contrastive clustering walkthrough •  Ti/Tv-free quality thresholding or commitment-free probabilistic callsets

Allele Frequenc

y Calculatio

n

Variant Quality

Recalibration

Beagle

Unified Genotyper

Analysis-ready BAMs

Page 5: MPG NGS workshop - Broad Institute · 2015. 10. 16. · SVs Typically by lane Typically multiple samples simultaneously but can be single sample alone Input ... Beagle Unified Genotyper

Variant annotations provide signal with which to remove artifacts!

22  49582364 . A G 198.96 . AB=0.67; AC=3; AF=0.50; AN=6; DP=87; Dels=0.00; HRun=1; MQ=71.31; MQ0=22; QD=2.29; SB=-31.76 GT:DP:GQ 0/1:12:99.00 0/1:11:89.43 0/1:28:37.78

VCF record for an A/G SNP at 22:49582364

5

AC No. chromosomes carrying alt allele

AB Allele balance of ref/alt in hets

AN Total no. of chromosomes HRun

Length of longest contiguous homopolymer

AF Allele frequency MQ RMS MAPQ of all reads

DP Depth of coverage MQ0 No. of MAPQ 0 reads at locus

QD QUAL score over depth SB Estimated SB score

INFO

fiel

d

Page 6: MPG NGS workshop - Broad Institute · 2015. 10. 16. · SVs Typically by lane Typically multiple samples simultaneously but can be single sample alone Input ... Beagle Unified Genotyper

Variant Quality Score Recalibration Model

Gaussian Mixture Model trained on annotated variants, find MAP using VBEM:

p(c) = p(z)p(c

| z) = π k p(π k )N(c

| µk ,Σk )p(µ

k ,Σk )

k=1

K

∑z∑

Normal – inverse Wishart distribution

Prior expectation is the empirical mean and empirical covariance of the data. Bias away from singularities.

Dirichlet distribution

Prior expectation is sparse set

Page 7: MPG NGS workshop - Broad Institute · 2015. 10. 16. · SVs Typically by lane Typically multiple samples simultaneously but can be single sample alone Input ... Beagle Unified Genotyper

Number of Novel Variants (1000s)0 500 1000 1500 2000

Ti/Tv FDR

10

5

1

0.1

1.91

1.99

2.06

2.07

0 100 200 300 400

Cumulative TPsTranch!specific TPsTranch!specific FPsCumulative FPs

Ti/Tv FDR

10

2

1

0.1

1.92

2.04

2.05

2.07

0.0 0.2 0.4 0.6 0.8

Ti/Tv FDR

10

2

1

0.1

2.79

2.96

2.98

3.01

Evaluating novel variants

More Bias

Less Bias

HiSeq: training on HapMap

More Bias

Less Bias

Likely dbSNP errors

Gaussian mixture model fits

A HM3 1KG Trio

99.5

99.5

99.5

99.5

98.2

98.5

98.4

98.6

NR sensitivity (%):

HiSeq C

HiSeq

88.0

89.7

88.0

93.8

HM3

96.0

96.8

96.0

98.3

D

65.1

66.2

65.3

66.9 86.5 With imputation

82.3

82.8

82.4

96.7 NGS only 83.0

B

Heterozygous variants Homozygous

variants

Exome

Low-pass E

Analysis tranche

Analysis tranche

HiSeq: evaluating novel variants

HiSeq HM3

Analysis tranche

Variant Quality Score Recalibration: training on highly confident known sites to determine the probability that other sites are true

Page 8: MPG NGS workshop - Broad Institute · 2015. 10. 16. · SVs Typically by lane Typically multiple samples simultaneously but can be single sample alone Input ... Beagle Unified Genotyper

Step 3: SNP discovery

Genotype Likelihoo

ds Calculatio

n

•  The variant quality recalibration process has gone through a major overhaul recently. Most notably, we have removed any dependency on Ti/Tv in the calculation. This and further changes are highlighted in the following slides.

•  Outline: •  Quick Variant Recalibration overview •  Contrastive clustering walkthrough •  Ti/Tv-free quality thresholding or commitment-free probabilistic callsets

Allele Frequenc

y Calculatio

n

Variant Quality

Recalibration

Beagle

Unified Genotyper

Analysis-ready BAMs

Page 9: MPG NGS workshop - Broad Institute · 2015. 10. 16. · SVs Typically by lane Typically multiple samples simultaneously but can be single sample alone Input ... Beagle Unified Genotyper

Running the Variant Quality Score Recalibrator

9 See http://www.broadinstitute.org/gsa/wiki/index.php/Variant_quality_score_recalibration

•  Wiki page has full list of command lines broken out by the various steps in the process

•  Wiki page also has links to all the data sets we recommend using as training data

•  In a few weeks this whole process will be condensed into two much easier to use steps

Page 10: MPG NGS workshop - Broad Institute · 2015. 10. 16. · SVs Typically by lane Typically multiple samples simultaneously but can be single sample alone Input ... Beagle Unified Genotyper

Contrastive VQSR Clustering Walkthrough

First partition the data into a training set by

looking at sites which overlap with HapMap3.3

and the Omni chip.

Page 11: MPG NGS workshop - Broad Institute · 2015. 10. 16. · SVs Typically by lane Typically multiple samples simultaneously but can be single sample alone Input ... Beagle Unified Genotyper

Contrastive VQSR Clustering Walkthrough

Using Variational Bayes EM algorithm learn

probability distribution over the training set.

Page 12: MPG NGS workshop - Broad Institute · 2015. 10. 16. · SVs Typically by lane Typically multiple samples simultaneously but can be single sample alone Input ... Beagle Unified Genotyper

Contrastive VQSR Clustering Walkthrough

Assign a probability to each variant based on

how well it clusters with the training set.

Unfortunately a sizeable number of

seemingly good variants fall outside of

the main clusters.

Furthermore, all clusters are

essentially two-sided tests but most

annotations are really only one-sided.

Page 13: MPG NGS workshop - Broad Institute · 2015. 10. 16. · SVs Typically by lane Typically multiple samples simultaneously but can be single sample alone Input ... Beagle Unified Genotyper

Contrastive VQSR Clustering Walkthrough

Solution: Train a second set of clusters based on the bottom

10% of variants which had the worst LOD.

This model for the bad variants allows for

contrastive evaluation. New LOD score

becomes difference between the good

model and the bad model.

Page 14: MPG NGS workshop - Broad Institute · 2015. 10. 16. · SVs Typically by lane Typically multiple samples simultaneously but can be single sample alone Input ... Beagle Unified Genotyper

Contrastive VQSR Clustering Walkthrough

Contrastive VQSR clustering allows us to

rescue the variants which fall outside of the main clusters but

which also don’t fit the model for bad

variants.

AC

Num

ber

of S

NP

s0 5 10 15 20 25

01000

2000

3000

4000

5000

6000

Page 15: MPG NGS workshop - Broad Institute · 2015. 10. 16. · SVs Typically by lane Typically multiple samples simultaneously but can be single sample alone Input ... Beagle Unified Genotyper

Step 3: SNP discovery

Genotype Likelihoo

ds Calculatio

n

•  The variant quality recalibration process has gone through a major overhaul recently. Most notably, we have removed any dependency on Ti/Tv in the calculation. This and further changes are highlighted in the following slides.

•  Outline: •  Quick Variant Recalibration overview •  Contrastive clustering walkthrough •  Ti/Tv-free quality thresholding or commitment-free probabilistic

callsets

Allele Frequenc

y Calculatio

n

Variant Quality

Recalibration

Beagle

Unified Genotyper

Analysis-ready BAMs

Page 16: MPG NGS workshop - Broad Institute · 2015. 10. 16. · SVs Typically by lane Typically multiple samples simultaneously but can be single sample alone Input ... Beagle Unified Genotyper

!

!

!

!

!

!

0.5 0.6 0.7 0.8 0.9 1.0

2.2

2.3

2.4

2.5

2.6

2.7

2.8

Tranche truth sensitivity

Spec

ifici

ty (N

ovel

Ti/T

v ra

tio)

!

!

!

!

!

!

0.5 0.6 0.7 0.8 0.9 1.0

1.9

2.0

2.1

2.2

Tranche truth sensitivitySp

ecifi

city

(Nov

el T

i/Tv

ratio

)

Sensitivity vs. specificity plots with the new Ti/Tv-less approach look

good

1000G low-pass August N=629 NA12878 HiSeq WGS

Page 17: MPG NGS workshop - Broad Institute · 2015. 10. 16. · SVs Typically by lane Typically multiple samples simultaneously but can be single sample alone Input ... Beagle Unified Genotyper

!"#$%&'#$(!""$"$()*+&#(,&()-.(/*0+"'#,*&

12*+

3'&4

3(*5(6

7/3

898

8:8

8;8

8<8

8

8 98 :8 ;8 <8 =8 >8 ?8 @8 A8 988 998 9:8

B&*C&(67/37*D$"(67/3E'0F'0(D;G:(H$4,3I*D$%J(H'#$9BK(/%*L$I#(/,"*#(9(H$4,3I*D$%J(H'#$

8G8

8G:

8G<

8G>

8G@

9G8

M'%,'

&#(H$4

,3I*D$%J(H

'#$

!"#$%&'()'%*+,-%.".$%'(/01%*+,-%

!"#$%&'#$(!""$"$()*+&#(,&()-.(/*0+"'#,*&

12*+

3'&4

3(*5(6

7/3

898

8:8

8;8

8<8

8

8 98 :8 ;8 <8 =8 >8 ?8 @8 A8 988 998 9:8

8G98N(OPH(#%'&I2$9G88N(OPH(#%'&I2$=G88N(OPH(#%'&I2$98G88N(OPH(#%'&I2$B&*C&(1,Q1D(H'#,*7*D$"(1,Q1D(H'#,*

:<R<=>(&*D$"(D'%,'&#3(,&('SS%$S'#$(C,#2(9G@;(1,Q1D(%'#,*

9G:

9G>

:G8

:G<

:G@

;G:

;G>

1%'&3,#,*

&(#*(1%'&

3D$%3,*

&(H'

#,*

8G8 8G: 8G< 8G> 8G@ 9G8

8G8

8G9

8G:

8G;

8G<

8G=

7*& H$5$%$&I$(!""$"$(O%$T+$&IJ

7HP(H'

#$

/%$ UV0+#'#,*&/*3# UV0+#'#,*&

WD$%'""(7HP(H'#$(X(:<G=N

WD$%'""(7HP(H'#$(X(>G@N

8 : < > @ 98

8G8

8G:

8G<

8G>

8G@

9G8

6$T+$&I,&S(P$0#2

7HP(H'

#$

/%$ UV0+#'#,*&/*3# UV0+#'#,*&

2% 3%

4% 5%

The low confidence tranches are comprised of the low frequency events (most likely FPs)

61-sample CEU from 1000G!

Page 18: MPG NGS workshop - Broad Institute · 2015. 10. 16. · SVs Typically by lane Typically multiple samples simultaneously but can be single sample alone Input ... Beagle Unified Genotyper

# samples

Center Total # variants

dbSNP% (129)

# knowns

Known ti/tv

# novels

Novel ti/tv

Includes genotype refinement?

1004 Broad 765,365

24.82 190,000

2.36 575,365

2.37 No

1004 BC 733,155

25.34 185,787

2.37 547,368

2.32 No

1004 Sanger 728,374

25.31 184,341

2.36 544,033

2.36 No

1004 UMich 721,250

26.46 190,871

2.33 530,379

2.35 Yes

1004 Oxford 660,024

27.44 181,095

2.38 478,929

2.38 Yes

1004 BCM 605,274

29.98 181,444

2.33 423,830

2.29 Yes

1004 NCBI 601,907

29.26 176,150

2.39 425,757

2.57 No

Broad discovered the most variants at very high quality levels in 1000G chr20 bake-off exercise

Page 19: MPG NGS workshop - Broad Institute · 2015. 10. 16. · SVs Typically by lane Typically multiple samples simultaneously but can be single sample alone Input ... Beagle Unified Genotyper

•  Our data processing pipeline produces really good SNP calls. The same pipeline is used for whole exome and WGS, both deep and low-pass sequencing. Short indel calls too!

•  Anything can be used as truth data. Validation assays, several 1000G callsets, or auto-generate your own by subsetting to the highest quality SNPs

•  There is no reason to decide between high sensitivity or high specificity. Just use a probabilistic callset.

•  The tools are available to all:

http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit

Final Thoughts

Page 20: MPG NGS workshop - Broad Institute · 2015. 10. 16. · SVs Typically by lane Typically multiple samples simultaneously but can be single sample alone Input ... Beagle Unified Genotyper

Appendix

Page 21: MPG NGS workshop - Broad Institute · 2015. 10. 16. · SVs Typically by lane Typically multiple samples simultaneously but can be single sample alone Input ... Beagle Unified Genotyper

Step 2a: SNP discovery

Genotype Likelihoo

ds Calculatio

n

•  The genotype likelihoods calculation now takes overlapping read pairs (where bases are not independent observations) into account, which we term “fragment-based calling”.

Allele Frequenc

y Calculatio

n

Variant Quality

Recalibration

Beagle

Unified Genotyper

Analysis-ready BAMs

Page 22: MPG NGS workshop - Broad Institute · 2015. 10. 16. · SVs Typically by lane Typically multiple samples simultaneously but can be single sample alone Input ... Beagle Unified Genotyper

L(G | D) = P(G)P(D |G) = P(b |G)b∈ good _ bases{ }∏

GATK single sample genotype likelihoods

•  Priors applied during multi-sample calculation; P(G) = 1 •  Likelihood of data computed using pileup of bases and

associated quality scores at given locus •  Only “good bases” are included: those satisfying

minimum base quality, mapping read quality, pair mapping quality, NQS

•  P(b | G) uses calibrated base quality score •  L(G|D) computed for all 10 genotypes

Prior for the genotype"

Likelihood for the genotype"

Likelihood of the data given the genotype"

Bayesian model

Independent base model"

See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper for more information 22

Page 23: MPG NGS workshop - Broad Institute · 2015. 10. 16. · SVs Typically by lane Typically multiple samples simultaneously but can be single sample alone Input ... Beagle Unified Genotyper

Step 2b: SNP discovery

Genotype Likelihoo

ds Calculatio

n

•  We now use Heng Li’s Exact model to calculate P(AF>0) instead of our previous heuristic grid search model.

Allele Frequenc

y Calculatio

n

Variant Quality

Recalibration

Beagle

Unified Genotyper

Analysis-ready BAMs

Page 24: MPG NGS workshop - Broad Institute · 2015. 10. 16. · SVs Typically by lane Typically multiple samples simultaneously but can be single sample alone Input ... Beagle Unified Genotyper

We apply a generalization of the single sample SNP caller for multi sample

data

•  This approach allows us to combine weak single sample calls to discover variation among several samples with high confidence

Individual 1"

Sample-associated reads"

Individual 2"

Individual N"

Genotype likelihoods"

Joint estimate across samples"

Genotype frequencies"

Allele frequency"

SNPs"

See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper for more information 24

Page 25: MPG NGS workshop - Broad Institute · 2015. 10. 16. · SVs Typically by lane Typically multiple samples simultaneously but can be single sample alone Input ... Beagle Unified Genotyper

Running the Unified Genotyper

See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper for more information 25

java -Xmx2048m –jar GenomeAnalysisTK.jar -R /broad/1KG/reference/human_b37_both.fasta -T UnifiedGenotyper-B:dbsnp,VCF dbsnp_132_b37.vcf-o NA19240.raw.vcf -stand_call_conf 30 --heterozygosity 1.000000e-03 -I NA19240.SLX.bam

Minimum phred-scaled confidence required to emit a

SNP

1 het per 1000 reference bases on average for a Yoruban

BAM file containing NA19240 SLX reads

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA19240 1 36496 . T A 53.13 . <ATTRIBUTES> GT:DP:GQ 1/0:6:84.70

1 45162 rs10399749 C T 331.37 . <ATTRIBUTES> GT:DP:GQ 0/1:27:99.00

1 48677 . G A 399.86 . <ATTRIBUTES> GT:DP:GQ 1/0:25:99.00

Long string of variant annotations (more info in a few slides) Raw VCF calls (NA19240.raw.vcf)

Page 26: MPG NGS workshop - Broad Institute · 2015. 10. 16. · SVs Typically by lane Typically multiple samples simultaneously but can be single sample alone Input ... Beagle Unified Genotyper

Variants with bad Haplotype Scores often exhibit good Ti/Tv ratios and are included in other centers’ callsets, but are likely

FPs

Bad sites being called by other centers but correctly filtered by the Broad.

These sites are potentially bad in other SNP annotation dimensions.

Higher score means more evidence for error.