efficient algorithms for snp genotype data analysis using hidden markov models of haplotype...

Efficient Algorithms for SNP Genotype Data Analysis using

HiddenMarkov Models of Haplotype

Diversity

Justin KennedyDissertation Defense for the Degree of Doctorate in Philosophy

Computer Science & Engineering DepartmentUniversity of Connecticut

1

Outline

Introduction

Hidden Markov Models of Haplotype Diversity

Genotype Error Detection using Hidden Markov Models of Haplotype Diversity

Imputation-based Local Ancestry Inference in Admixed Populations

Single Individual Genotyping from Low-Coverage Sequencing Data

Conclusion

2

Introduction-Single Nucleotide Polymorphisms

Main form of variation between individual genomes: Single Nucleotide Polymorphisms (SNPs)

High density in the human genome: 1.3x107 out of 3109 base pairs

Vast majority bi-allelic 0/1 encoding (major/minor resp.)

Haplotype: description of SNP alleles on a chromosome0/1 vector: 0 for major allele, 1 for minor

… ataggtccCtatttcgcgcCgtatacacgggActata …

… ataggtccGtatttcgcgcCgtatacacgggTctata …

… ataggtccCtatttcgcgcGgtatacacgggTctata …

3

Genotype Error Detection-SNP Genotypes

Diploid: two haplotypes for each chromosome One inherited from mother and one from father

Multilocus Genotype: description of alleles on both chromosomes

0/1/2 vector: 0 (2) - both chromosomes contain the major (minor) allele; 1 - the chromosomes contain different alleles

SNP Genotypes are critical to Disease-Gene Mapping

011000110001100010012100120

+two haplotypes per individual

genotype

4

Introduction- Why SNP Genotypes?

SNPs are the genetic marker of choice for genome wide association studies (GWASs)

GWAS: Method for discovering disease associated genes by typing a dense set of markers in large numbers of cases and controls followed by a statistical test of association.

Ongoing GWASs generate a deluge of genotype data Genetic Association Information Network (GAIN): 6 studies totaling

19,000 individuals typed at 500,000 to 940,000 SNP loci Wellcome Trust Case-Control Consortium (WTCCC): 7 studies

totaling 17,000 individuals typed at 500,000 SNPs WTCCC2: hundreds of thousands of individuals covering over a

million SNPs!5

Introduction-Computational Challenges to Disease Gene Mapping

Genotype error detection: Genotyping errors can decrease statistical power and invalidate statistical tests for disease association based on haplotypes.

Local ancestry Inference: Accurate estimates of local ancestry surrounding disease-associated loci are a critical step in admixture mapping.

Accurate SNP Genotyping from new sequencing

technologies: Accurate determination of both alleles at variable loci is essential, and is limited by coverage depth due to random nature of shotgun sequencing.

6

Outline

Introduction





Conclusion 7

Haplotype structure in panmictic populations

8

HMM of haplotype diversity

Similar models proposed in [Schwartz 04, Rastas et al. 08, Kennedy et al. 07, Kimmel&Shamir 05, Scheet&Stephens 06,…]

Captures Linkage Disequilibrium (LD)

k = 4(# founders)

n = 5(# SNPs)

9

Graphical model representation

Random variables for each locus i (i=1..n) Fi = founder haplotype at locus i; values between 1 and k Hi = observed allele at locus i; values: 0 (major) or 1 (minor)

Model training, based on Baum-Welch algorithm, using: Reference haplotypes from population panel (e.g. Hapmap), or Haplotypes from phased genotype using ENT software

Given haplotype h, P(H=h|M) can be computed in O(nk2) using a forward algorithm.

F2 Fn…

H1 H2 Hn

10

F1

F1 F2 Fn…

H1 H2 Hn

F'1 F'2 F'n…

H'1 H'2 H'n

G1 G2 Gn

Factorial HMM for genotype data

Random variable for each locus i (i=1..n) Gi = genotype at locus i; values: 0/1/2 (major hom./het./minor hom.) Given multilocus genotype g, P(g|M) can be computed in O(nk4) using

a forward algorithm. 11

HMM Based Genotype Imputation

Probability of observing genotype at locus i given the known multilocus genotype with missing data at i:

Gi is imputed as:

x

12

)|],...,,,,...,([)],,...,,,...,[|( 111111 MGGxGGPMGGGGxGP niiniii

)|],...,,,,...,([maxarg 111}2,1,0{ MGGxGGP niix

Fi …

Hi

Gi

F’i …

H’i

…

…

Forward-backward computation

)()|( '' ''1 ,1 ,, i

iFF

k

Fi

FFi

FF

k

FGMgP

iii iiiii

13

Fi …

Hi

Gi

F’i …

H’i

…

…


14

)()|( '' ''1 ,1 ,, i

iFF

k

Fi

FFi

FF

k

FGMgP

iii iiiii

Fi …

Hi

Gi

F’i …

H’i

…

…


15

)()|( '' ''1 ,1 ,, i

iFF

k

Fi

FFi

FF

k

FGMgP

iii iiiii

Fi …

Hi

Gi

F’i …

H’i

…

…


16

)()|( '' ''1 ,1 ,, i

iFF

k

Fi

FFi

FF

k

FGMgP

iii iiiii

Runtime Direct recurrences for computing forward probabilities

O(nk4) :

Runtime reduced to O(nk3) by reusing common terms:

where

17

)'()(1', iiFF FPFP

ii

k

Fi

iFFiiii

iFF

k

F

iFF

i

iiii

i

iiGFFPFFP

1'1

1',11

1',

1',

1

1111

1

)()'|'()|(

k

F

iFFii

iFF

i

iiiixFFP

1'',1',

1

11)'|'(

k

Fi

iFFii

iFF

iFF

i

iiiiiiGFFPx

1'1

1',1

1',',

1

1111)()'|'(

Speed-up: PopTree Trie

18

Outline

Introduction


Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Motivation Likelihood Sensitivity Approach to Error Detection

Hidden Markov Model of Haplotype Diversity

Efficiently Computable Likelihood functions

Experimental Results



Conclusion19

Genotype Errors- Motivation

A real problem despite advances in genotyping technology [Zaitlen et al. 2005] found 1.1% inconsistencies among the 20 million dbSNP

genotypes typed multiple times 1% errors decrease power by 10-50% for linkage, and by 5-20% for association

Error types Easily Detectable errors

Systematic errors (e.g., assay failure) detected by HWE test [Hosking et al. 2004] For pedigree data some errors detected as Mendelian Inconsistencies (MIs)

E.g. Only ~30% detectable as MIs for trios [Gordon et al. 1999] Undetected errors

Methods for handling undetected errors: Improved genotype calling algorithms Improved modeling in analysis methods Separate error detection step

Detected errors can be retyped, imputed, or ignored in downstream analysis20

Genotype Error Detection- Likelihood Sensitivity Approach to Error Detection [Becker et al. 06]

0 1 2 1 0 2

0 2 2 1 0 2

0 2 2 1 0 2

Mother Father

Child

Likelihood of best phasing for original trio T

0 1 1 1 0 0 H10 0 0 1 0 1 H3

0 1 1 1 0 0 H1

0 1 0 1 0 1 H2

0 0 0 1 0 1 H3

0 1 1 1 0 0 H4

)()()()( MAX)( 4321 HpHpHpHpTL

21


0 1 2 1 0 2

0 2 2 1 0 2

0 2 2 1 0 2

Mother Father

Child


? 0 1 0 1 0 1 H’ 1 0 0 0 1 0 0 H’ 3

0 1 0 1 0 1 H’1

0 1 1 1 0 0 H’2

0 0 0 1 0 0 H’ 3

0 1 1 1 0 1 H’ 4

Likelihood of best phasing for modified trio T’ 22

)()()()( MAX)( 4321 HpHpHpHpTL

)'()'()'()'( MAX)(' 4321 HpHpHpHpTL

0 1 2 1 0 2

0 2 2 1 0 2

0 2 2 1 0 2

Mother Father

Child

0 1 2 1 0 2

0 2 2 1 0 2

0 2 2 1 0 2

Mother Father

Child

Large change in likelihood suggests likely error Flag genotype as an error if L(T’)/L(T) > R, where R is the

detection threshold (e.g., R=104)

?


23

Genotype Error Detection- Likelihood Sensitivity Approach to Error Detection

[Becker et al. 06] Implementation in FAMHAP Software

Window-based algorithm For each window including the SNP

under test, generate list of H most frequent haplotypes (default H=50)

Find most likely trio phasings by pruned search over the H4 quadruples of frequent haplotypes

Flag genotype as an error if L(T’)/L(T) > R for at least one window

Mother …201012 1 02210...Father …201202 2 10211...Child …000120 2 21021...

24

Genotype Error Detection- Limitations of FAMHAP

Unbounded list of haplotypes (H=4n) is hard to compute

Truncating H may lead to sub-optimal phasings and inaccurate L(T) values

False positives caused by nearby errors (due to the use of multiple short windows)

Our approach: HMM of haplotype frequencies all haplotypes

represented + no need for short windows Alternate likelihood functions scalable runtime

25

Trio-Based HMM of haplotype diversity

26

F2 Fn…

H1 H2 Hn

F‘1

H'1 H'2 H'n

GM1 GM2 GMn

F1 F2 Fn…

H1 H2 Hn

F‘2 F'n… F’1 F’2 F’n…

H’1 H’2 H’n

GF1 GF2 GFn

GC1 GC2 GCn

F1

Genotype Error Detection- Alternate Likelihood Functions

Viterbi probability (ViterbiProb): Maximum prob. of a set of 4 HMM paths that emit 4 haplotypes compatible with the trio.

Probability of Viterbi Haplotypes (ViterbiHaps): Obtain the path of the 4 Viterbi haplotypes, then then take product of these individual haplotype probabilities using forward (again).

Total Trio Probability (TotalProb): Total probability P(T) that the HMM emits four haplotypes that explain trio T along all possible 4-tuples of paths.

27

Genotype Error Detection- Speed Ups from reuse of common terms

Straight-forward approach run time: For a fixed trio, ViterbiProb/TotalProb paths can be found using a 4-path version of

Viterbi’s/Forward algorithm in time For ViterbiHaps, additional traceback to compute probabilities:

K3 speed-up by reuse of common terms: per trio

Likelihoods of all 3n modified trios computed using forward-backward algorithm,

ViterbiProb/TotalProb for m trios: ViterbiHaps:

)( 5nkO

))(( 25 knnkmO

)( 5mnkO

28

)( 8nkO)( 28 knnkO

Genotype Error Detection- Comparison of Likelihood

Functions

-0.005 0.005 0.0150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

VitHaps-PVitProb-PTotalProb-PVitHaps-CVitProb-CTotalProb-C

FP rate

Sens

itivi

ty

Sensitivity=TP/(TP+TN)

False Positive rate = 1 - TN/(FP+TN)

35 SNPs551 trios[Becker 06]1% err. rate

29

Genotype Error Detection-“Combined” Detection Method

Compute 4 likelihood ratios

Trio Mother-child duo Father-child duo Child (unrelated)

Flag as error if all ratios are above detection threshold

30

Genotype Error Detection- Comparison with FAMHAP

0.5

0.6

0.7

0.8

0.9

1

0 0.005 0.01 0.015

Sensitivity

FP rate

#FP=#FN line

TotalProb-TRIO

TotalProb-COMBINEDFAMHAP

Children

0.5

0.6

0.7

0.8

0.9

1

0 0.005 0.01 0.015

SensitivityParents

FP Rate


31

Outline

Introduction



Imputation-based Local Ancestry Inference in Admixed Populations Motivation Factorial HMM of genotype data Algorithms for genotype imputation and ancestry inference Experimental results


Conclusion 32

Introduction- Population admixture

http://www.garlandscience.co.uk/textbooks/0815341857.asp?type=resources 33

Introduction- Motivation: Admixture mapping

Patterson et al, AJHG 74:979-1000, 2004 34

Introduction- Local ancestry inference problem

rs11095710 T T rs11117179 C T rs11800791 G G rs11578310 G Grs1187611 G Grs11804808 C C rs17471518 A G...

Given: Reference haplotypes for all ancestral populations to be

studied Whole-genome SNP genotype data for extant individual

Find: Allele ancestries at each SNP locus

Reference haplotypes

SNP genotypes

rs11095710 P1 P1rs11117179 P1 P1rs11800791 P1 P1rs11578310 P1 P2rs1187611 P1 P2rs11804808 P1 P2rs17471518 P1 P2...

1110001?0100110010011001111101110111?1111110111000 11100011010011001001100?100101?10111110111?0111000111100100110011010011100101101010111110111101110001110001001000100111110001111011100111?111110111000011101100110011011111100101101110111111111?011000011100010010001001111100010110111001111111110110000011?001?011001101111110010?1011101111111111011000011100110010001001111100011110111001111111110111000

Inferred local ancestry

1110001?0100110010011001111101110111?1111110111000 11100011010011001001100?100101?10111110111?0111000111100100110011010011100101101010111110111101110001110001001000100111110001111011100111?111110111000011101100110011011111100101101110111111111?011000011100010010001001111100010110111001111111110110000011?001?011001101111110010?1011101111111111011000011100110010001001111100011110111001111111110111000

1110001?0100110010011001111101110111?1111110111000 11100011010011001001100?100101?10111110111?0111000111100100110011010011100101101010111110111101110001110001001000100111110001111011100111?111110111000011101100110011011111100101101110111111111?011000011100010010001001111100010110111001111111110110000011?001?011001101111110010?1011101111111111011000011100110010001001111100011110111001111111110111000

35

Introduction- Previous work Two main classes of methods for SNP

Ancestry Inference HMM-based (exploit LD): SABER [Tang et al 06],

SWITCH [Sankararaman et al 08a], HAPAA [Sundquist et al. 08], …

Window-based (unlinked SNP Data): LAMP [Sankararaman et al 08b], WINPOP [Pasaniuc et al. 09]

Limitations Poor accuracy when ancestral populations are

closely related (e.g. Japanese and Chinese) Methods based on unlinked SNPs outperform

methods that model LD!36

F2 Fn…

H1 H2 Hn

F'1 F'2 F'n…

H'1 H'2 H'n

G1 G2 Gn

Factorial HMM for genotype data in a window with known local ancestry

klM

37

F1

Fixed-window version: pick ancestry that maximizes the average posterior probability of the SNP genotypes within a fixed-size window centered at the locus

Observations: The local ancestry of a SNP locus is typically shared with

neighboring loci. Small Window sizes may not provide enough information Large Window sizes may violate local ancestry property for

neighboring loci

Imputation-based ancestry inference

38

11M 12M 22M

N=2,000g=7

=0.2n=38,864

r=10-8

Window size effect

39

Multi-window version: Weighted voting over window sizes

between 200-3000, with window weights proportional to average posterior probabilities


40

N=2,000g=7

=0.2n=38,864

r=10-8

Comparison with other methods% of correctly recovered SNP ancestries

41

N=2,000g=7

=0.5n=38,864

r=10-8

Untyped SNP imputation error rate in admixed individuals

42

Genotype Imputation- Accuracy with number

founders/runtime5835 SNPs2502 unrelated (CEU)[IMAGE]9% imputed (535 SNPs)

43

3 5 7 9 11 13 1585%

87%

89%

91%

93%

95%

97%

99%

-

500

1,000

1,500

2,000

2,500

3,000

3,500

4,000

Impu

tatio

n Er

ror R

ate

CPU

seco

nds

# Founders

Number of founders effect on Ancestry inference

CEU-JPTN=2,000

g=7=0.2

n=38,864 r=10-8

44

Outline

Introduction




Single Individual Genotyping from Low-Coverage Sequencing Data Motivation Single SNP Calling Algorithms HF-HMM Overview Multilocus HMM Calling Algorithm Experimental Results

Conclusion 45

Low Coverage Genotyping-Next Generation Sequencing (NGS)

By several orders of magnitude, NGS delivers higher throughput of sequencing reads compared to older technologies (e.g. Sanger sequencing)

46

Roche/454 FLX Titanium~1M reads400bp avg. 400-600Mb / run (10h)

ABI SOLiD 3 plus~500M reads/pairs35-50bp25-60Gb / run (3.5-14 days)

Illumina Genome Analyzer IIx~100-300M reads/pairs35-100bp4.5-33 Gb / run (2-10 days)

Low Coverage Genotyping-NGS Applications and Challenges

NGS is enabling many applications, including personal genomics ~$100 million for the Sanger-sequenced Venter genome [Levy et al 07] ~$1 million for sequencing James Watson genome [Wheeler et al 08]

using 454 technology. ~$50K human sequencing now available Thousands more individual genomes to be sequenced as part of

1000 Genomes Project Challenges:

Sequencing requires accurate determination of genetic variation (e.g. SNPs)

Accuracy is limited by coverage depth due to random nature of shotgun sequencing

For the Venter and Watson genomes (both sequenced at ~7.5x average coverage), only 75-80% accuracy achieved for sequencing based calls of heterozygous SNPs [Levy et al 07, Wheeler et al 08].

[Wheeler et al 08] use hypothesis testing based on binomial distribution

47

http://www.1000genomes.org/

Low Coverage Genotyping-Do Heuristic Inputs Help?

[Wendl&Wilson 08] predict that 21x coverage is required for sequencing of samples based on the assumption that “neglects any heuristic inputs”

We propose methods incorporating two additional sources of information:

Quality scores reflecting uncertainty in sequencing data

Linkage disequilibrium (LD) information and allele frequencies extracted from reference panels such as Hapmap

48

Low Coverage Genotyping-Pipeline for Single Genotype Calling

49

Single SNP Genotyping- Basic Notations

Biallelic SNPs: 0 = major allele, 1 = minor allele (reads with non-reference alleles are discarded)

SNP genotypes: 0/2 = homozygous major/minor, 1=heterozygous

Read set ri describes the mapped reads for each SNP i

Inferred genotypesMapped reads with allele 0

Mapped reads with allele 1012100120

Sequencing errors50

Applying Bayes’ formula:

Where: is the conditional probability for the read set at

locus i

are genotype frequencies at inferred from a representative panel

Single SNP Genotype Calling

}2,1,0{

)|r()()|r()()r|(

x iii

iiiii xGPxGP

GPGPGP

)( iGP

51

)|r( ii GP

Low Coverage Genotyping-Pipeline for Multilocus Genotyping


52

Multilocus Genotyping-HF-HMM

F1 F2 Fn…

H1 H2 Hn

G1 G2 Gn

…R1,1 R2,1

F'1 F'2 F'n…

H'1 H'2 H'n

R1,c … R2,c …Rn,1 Rn,c1 2 n

HMMs representing LD in populations of origin for mother/father; similar to models used in [Scheet & Stephens 06, Rastas et al 08, Kennedy et al 08]

53

Multilocus Genotyping-HF-HMM Training

Training HMM based on Baum-Welch algorithm from haplotypes inferred from populations of origin for mother/father

Use haplotype reference panel (e.g. HAPMAP) for training Haplotypes

Conditional probabilities for read sets are given by the formulas derived for the single SNP case:

54

1)(r

)(

0)(r

)( )1()0|r(ir

rir

irr

iriiii

GP

0)(r

)(

1)(r

)( )1()2|r(ir

rir

irr

iriiii

GP

ic

ii GP

21)1|r(

Remark: maxgP((G*1,G*2,…,G*n)| r) is hard to approximate within unless ZPP=NP, and thus the multilocus genotyping problem is NP-hard

Multilocus Genotyping Problem

GIVEN: • Shotgun read sets r=(r1, r2, … , rn)• Quality scores• Trained HMMs representing LD in populations of origin for

mother/fatherFIND:

• Multilocus genotype (G*1,G*2,…,G*n) with maximum posterior probability, i.e., (G*1,G*2,…,G*n) argmaxG1..Gn

P(G1…Gn | r)

)( 1 nO

55

Multilocus Genotyping-HMM-Posterior Decoding Algorithm

1. For each i = 1..n, compute

2. Return *)*,...,( 1 nGG)Mr,|(maxarg* iGi GPG

i

56

Forward-Backward Computation of Posterior Probabilities

Fi …

Hi

Gi

…R1,1Ri,1

F’i …

H’i

R1,c …Ri,c …1

i n

…

…

57Rn,cRn,1

)()|r()Mr,|( '' ''1 ,1 ,, i

iFF

k

Fi

FFi

FF

k

Fiii GGPGPiii iiiii


Fi …

Hi

…R1,1Ri,1

F’i …

H’i

R1,c …Ri,c …1

i n

…

…

58

)()|r()Mr,|( '' ''1 ,1 ,, i

iFF

k

Fi

FFi

FF

k

Fiii GGPGPiii iiiii

Rn,cRn,1

Gi

Fi …

Hi

Gi

…R1,1Ri,1

F’i …

H’i

R1,c …Ri,c …Rn,11

i n

…

…


59Rn,c

)()|r()Mr,|( '' ''1 ,1 ,, i

iFF

k

Fi

FFi

FF

k

Fiii GGPGPiii iiiii

Fi …

Hi

Gi

…Ri,1

F’i …

H’i

R1,c …Ri,c … Rn,c1

i

…

…


60Rn,1R1,1

)()|r()Mr,|( '' ''1 ,1 ,, i

iFF

k

Fi

FFi

FF

k

Fiii GGPGPiii iiiii

Fi …

Hi

Gi

…R1,1Ri,1

F’i …

H’i

R1,c …Ri,c …1

i n

…

…


61Rn,cRn,1

)()|r()Mr,|( '' ''1 ,1 ,, i

iFF

k

Fi

FFi

FF

k

Fiii GGPGPiii iiiii

Multilocus Genotyping-Runtime

Direct implementation gives O(m+nk4) time: m = number of reads n = number of SNPs k = number of founder haplotypes in HMMs

Runtime reduced to O(m+nk3) by reusing common terms:

where

k

F

iFFii

iFF

iFF

iiiiiii

FFP1

1,

'1

'1,,

'1

'11

'11

'1

)|(

k

F

iFFii

iFF

iiiii

FFP1

,1,'1

'1

' )|(

}1,0{

'',

'' )|()|()|()(

iiii

HHiiiiiiiii

iFF

HHGRPFHPFHPG62

Low Coverage Genotyping-Experimental Results- Setup

Subset of James Watson’s 454 reads 74.4M of 106.5M reads

265 bp/read avg coverage: 5.64X

Quality scores included Reads mapped on human genome build 36.3

using the nucmer tool of the MUMmer package [Kurtz et al 04]

Estimated mapping error rates: FP rate: 0.37% FN rate: 21.16%

Haplotype reference panel used to train HMM generated from Hapmap CEU genotypes (release 23a)

63

Accuracy Comparison (Heterozygous Genotypes)

80

82

84

86

88

90

92

94

96

98

100

0 10 20 30 40 50 60 70 80 90 100

% c

orre

ctly

cal

led

% uncalled

1SNP-Posterior Binomial HMM-Posterior

64

Accuracy Comparison (All Genotypes)

93

94

95

96

97

98

99

100

0 20 40 60 80 100

% c

orre

ctly

cal

led

% uncalled

1SNP-Posterior Binomial HMM-Posterior

65

Accuracy at Varying Coverages (All Genotypes)

70

75

80

85

90

95

100

0 10 20 30 40 50 60 70 80 90 100

% uncalled

% c

orre

ctly

cal

led

Posterior 1/16 Posterior 1/8 Posterior 1/4 Posterior 1/2 Posterior 1/1Binomial 1/16 Binomial 1/8 Binomial 1/4 Binomial 1/2 Binomial 1/1

66

Outline

Introduction





Conclusion 67

Conclusion

68

Genotype Error Detection Contributions:

Proposed efficient methods for error detection in trio genotype data based on an HMM of haplotype diversity

Can exploit available pedigree info Yield improved detection accuracy compared to FAMHAP Runtime grows linearly in #SNPs and #individuals

Papers: J. Kennedy, I.I. Mandoiu, and B. Pasaniuc. Genotype Error Detection

using Hidden Markov Models of Haplotype diversity. Journal of Computational Biology (to Appear).

J. Kennedy, I.I. Mandoiu and B. Pasaniuc. Genotype Error Detection using Hidden Markov Models of Haplotype Diversity. Proc. WABI 2007, R. Giancarlo and S. Hannenhalli (eds.), LNBI 4645:73-84, 2007

Software: GEDI (Genotype Error Detection and Imputation): J. Kennedy, I.I. Mandoiu and B. Pasaniuc. GEDI: Scalable Algorithms for

Genotype Error Detection and Imputation. ARXIV Report, 2009 Best poster award: J. Kennedy, I.I. Mandoiu and B. Pasaniuc. Genotype Error

Detection and Imputation using Hidden Markov Models of Haplotype Diversity. ISBRA 2008.

Conclusion Imputation-based local ancestry inference in admixed

populations Contributions:

Imputation-based local ancestry inference achieves significant improvement over previous methods for admixtures between close ancestral populations

Future work: Evaluating accuracy under more realistic admixture scenarios (multiple

ancestral populations/gene flow/drift in ancestral populations) Extension to pedigree data Extensions to sequencing data Exploiting inferred local ancestry for phasing of admixed individuals Inference of ancestral haplotypes from extant admixed populations

Papers: B. Pasaniuc and J. Kennedy and I.I. Mandoiu, Imputation-based local

ancestry inference in admixed populations, Proc. 5th International Symposium on Bioinformatics Research and Applications/2nd Workshop on Computational Issues in Genetic Epidemiology, pp. 221-233, 2009

Software: GEDI-ADMX Unix and Windows version (geneAdmixViewer on Windows) 69

Conclusion Genotyping from low coverage sequencing reads

Contributions: Exploiting “heuristic inputs” such as quality scores and population allele

frequency and LD information yields significant improvements in genotyping calling accuracy from low-coverage sequencing data

LD information extracted from a reference panel gives highest benefit Relatively small gain from incorporating quality scores may be due in part to the

poor calibration of 454 quality scores [Brockman et al 08, Quinlan et al 08] Although our evaluation is on 454 reads, the methods are well-suited for short

read technologies Future Work: Population-based Genotyping from Low-Coverage

Sequencing Data Extending the single individual genotyping methods to population sequencing

data (removing the need for reference panels) Use Same HF-HMM as before, only training model based off of EM algorithm

on population-level data provided. Papers:

J. Duitama, S. Dinakar, Y. Hernandez, J. Kennedy, I. Mandoiu, and Y. Wu. Single individual genotyping from low-coverage sequencing data (In Preparation)

Presentation: J.Kennedy. Linkage Disequilibrium Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads. DIMACS workshop on Computational Issues in Genetic Epidemiology.

Software: Gene-Seq70

Questions?

71

Acknowledgments

This work was supported in part by NSF (awards IIS-0546457, DBI-0543365, and CCF-0755373) and by the University of Connecticut Research Foundation

72

Click again for helper slides

73

HAPMAP: The International HAPMAP Project is an organization

whose goal is to develop a haplotype map of the human genome (the HapMap), which will describe the common patterns of human genetic variation.

HAPMAP is expected to be a key resource for researchers to find genetic variants affecting health, disease and responses to drugs and environmental factors. The information produced by the project is made freely available to researchers around the world.

The International HapMap Project is a collaboration among researchers at academic centers, non-profit biomedical research groups and private companies in Canada, China, Japan, Nigeria, the United Kingdom, and the United States.

74

http://en.wikipedia.org/wiki/Haplotype

http://en.wikipedia.org/wiki/Human_genome

http://en.wikipedia.org/wiki/Genetic_variability

http://en.wikipedia.org/wiki/Canada

http://en.wikipedia.org/wiki/China

http://en.wikipedia.org/wiki/Japan

http://en.wikipedia.org/wiki/Nigeria

http://en.wikipedia.org/wiki/United_Kingdom

http://en.wikipedia.org/wiki/United_States

Baum-Welch overview The algorithm has two steps:

Calculating the forward probability for each HMM state;

Determining the frequency of the transition-emission pair values and dividing it by the probability of the entire sequence. This amounts to calculating the expected count of the particular transition-emission pair. Each time a particular transition is found, the value of the quotient of the transition divided by the probability of the entire sequence goes up, and this value can then be made the new value of the transition.

75

GEDI Speed-up: PopTree Trie Due to limited genotype variation across individuals of the

same population, additional re-use of forward and backward probability computations corresponding to genotype prefixes (respectively suffixes) shared by multiple genotypes is possible.

GEDI builds PopTree, which is a trie (prefix tree), from the given multilocus genotypes and then computes probabilities by performing a preorder traversal of the trie.

Specifically, the PopTree data structure for unrelated individuals in a population consists of:

Up to n levels, Each node has up to 3 child edges one for each possible

genotype value (0, 1, 2).

76

GEDI Speed-up: PopTree Speed-up

3 4 5 6 8 10 15 20 25 50 100 -

1,000

2,000

3,000

4,000

5,000

6,000

7,000

7 Pop-Tree13 Pop-Tree7 Slow13 Slow

Time (Seconds)

# Flank-ing

77

Genotype Imputation- Accuracy with varying parameters

5835 SNPs2502 unrelated[IMAGE]9% imputed (535 SNPs)

3060

90120

220320

420520

5.00%7.00%9.00%11.00%13.00%15.00%17.00%

3

5

7

9

11

13

15

3 5 7 9

11 13 15

#Founders

# Training Haplo-

types

% Error Rate

78

Genotype Error Detection- Experimental Results (Setup)

Real dataset [Becker et al. 2006] 35 SNP loci covering a region of 91kb 551 trios

Synthetic datasets 35 SNPs, 551 trios Preserved missing data pattern of real

dataset Haplotypes assigned to trios based on

frequencies inferred from real dataset 1% error rate using random allele insertion

model79

Error Detection Accuracy on Unrelated Genotype Data

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.005 0.01 0.015

Sens

itivi

ty

FP rate

#FP=#FN line

Len=10Kb (U)

Len=100Kb (U)

Len=1Mb (U)

Len=10Mb (U)

551 unrelated individuals Recombination & mutation rates of 10-8 per generation

per bp 35 SNPs within a region of 10kb-10Mb

Genotype Error Detection- Effect of Distance between SNPs

80

Genotype Error Detection- TrioProb-Combined Results on Real Dataset

[Becker et al. 06] resequenced all trio members at 41 loci flagged by FAMHAP-3

26 SNP genotypes in 23 trios were identified as true errors 41*3-26=97 resequenced SNP genotypes agree with

original calls (or are unknown)

Total Signals True Positives False Positives Unknown

FP Rate 1%.5% .1% 1%.5% .1% 1%.5% .1% 1%.5% .1%

Parents 218 127 69 9 9 8 1 0 0 208 118 91

Children 104 74 24 11 11 11 3 3 2 90 60 11

Total 322 201 93 20 20 19 4 3 2 298 178 72

81

Genotype Error Detection- Error Model Comparison

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.005 0.01 0.015

FP rate

Sens

itivi

ty

random allele (P)random geno (P)hetero-to-homo (P)homo-to-hetero (P)random allele (C)random geno (C)hetero-to-homo (C)homo-to-hetero (C)

82

Genotype Error Detection- Distribution of LLR for Total Trio Prob.

Parents-TRIOS

1

10

100

1000

10000

100000

1000000

0

0.27

0.54

0.81

1.08

1.35

1.62

1.89

2.16

2.43 2.

7

2.97

3.24

3.51

3.78

4.05

4.32

4.59

4.86

5.13 5.

4

5.67

5.94

NO_ERR ERR

Children-TRIOS

1

10

100

1000

10000

100000

1000000

0

0.27

0.54

0.81

1.08

1.35

1.62

1.89

2.16

2.43 2.

7

2.97

3.24

3.51

3.78

4.05

4.32

4.59

4.86

5.13 5.

4

5.67

5.94

NO_ERR ERR

Same-locus errors in parents


83

Genotype Error Detection-LLR for Combined Method

Parents-COMBINED

1

10

100

1000

10000

100000

1000000

0

0.27

0.54

0.81

1.08

1.35

1.62

1.89

2.16

2.43 2.7

2.97

3.24

3.51

3.78

4.05

4.32

4.59

4.86

5.13 5.4

5.67

5.94

NO_ERR ERR

Children-COMBINED

1

10

100

1000

10000

100000

1000000

0

0.28

0.56

0.84

1.12 1.4

1.68

1.96

2.24

2.52 2.8

3.08

3.36

3.64

3.92 4.2

4.48

4.76

5.04

5.32 5.6

5.88

NO_ERR ERR


84

Genotype Error Detection- Effect of Population Size

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.005 0.01 0.015

FP rate

Sens

itivi

ty

n=551 (P)n=129 (P)n=30 (P)n=551 (C)n=129 (C)n=30 (C)

85

Genotype Imputation-Effect of flanking size

3 4 5 6 8 10 15 20 25 50 1000.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

-

10,000

20,000

30,000

40,000

50,000

7 Founders- Error Rate13 Founders- Error Rate7 Founders- Time13 Founders- Time

Error Rate

Time (Seconds)

# Flanking

86

Genotype Imputation-Effect of pedigree data/haplotypes

30 60 90 120 220 320 420 5205.00%

6.00%

7.00%

8.00%

9.00%

10.00%

11.00%

12.00%

13.00%

14.00%

Trios 7-founders

Unrelated 7-founders

Unrelated 13-founders

Trios 13-founders

Error Rate

#Training Haplotypes

87

Introduction- Single Nucleotide Polymorphisms


Human genome density: 1.2x107 out of 3109

base pairs Vast majority bi-allelic 0/1 encoding (major/minor

resp.)

SNP Genotypes are critical to Disease-Gene Mapping One Method: Admixture Mapping




012100120

011000110

001100010


genotype

88

Helper Slide-Software Overview Only Require information about ancestral allele

frequencies: LAMP (No recombination assumption) WINPOP (a more refined model of recombination

events coupled with an adaptive window size computation to achieve increased accuracy)

SWITCH (HMM-Based) Only require ancestral allele frequencies &

genotypes SABER (HMM-Based)

Additionally use ancestral haplotype information:

HAPAA (HMM-Based) GEDI-ADMX (HMM-Based) 89

Helper Slide- Probabilities

)[(~)|( g]gg i gPgGPklkl MiiM

:,...,,,...,()[ n1i1-i1i ggg,ggg]g gg

:),...,,,...,( n1i1-i1 gggg ig:}2,1,0{g:}2,1,0{iG

:klM

Random variable genotype at SNP i

Multilocus genotype with I set to

Multilocus genotype without i

Genotype variable taken at SNP i

HMM with ancestral pair k,l

Helper Slide- Emission Details

ig

igli

ki

li

ki

li

ki

hhhh

li

li

ki

kiFF

i FhPFhP}1,0{,

)|()|()(E

91

Input For every Window half-size w

Output (Single Window method) For every i=1..n:

Where:

)|( iiM gGPkl g {0,1,2}g,, klMi

i

klWj

iiMi

Akli gGPW

a )|(||

1maxargˆ 1ig

}1|{ nlkklA

}},min{},...,1{max{ winwiWi

Window-based local ancestry inference

Aai ˆ

92

6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….

…….…………………………..……

k

l

YORP :1 CEUP :2

???????????????????????????????????????????????????????????????????????????????????????

???????????????????????????????????????????????????????????????????????????????????????

i

klWj

iiMi

Akli gGPW

a )|(||

1maxargˆ 1ig


93

??????????????????????????????????????????????????????????????????????????????????????

6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….

…….…………………………..……

k

l

YORP :1 CEUP :2

??????????????????????????????????????????????????????????????????????????????????????

}11..1{

11 )(111maxargˆ

jMAkl GPa

kl 1g 10w }11,..,1{1 W

}11..1{}11..1{}11..1{

221211 111

111

111

jM

jM

jM PPP

1

1


94

6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….

…….…………………………..……

k

l

YORP :1 CEUP :2

10w }12,..,1{2 W

1

1

}12..1{

2222 )|(121maxargˆ

jMAkl gGPa

kl 2g

1

1

}12..1{}12..1{}12..1{

221211 121

121

121

jM

jM

jM PPP

????????????????????????????????????????????????????????????????????????????????????

????????????????????????????????????????????????????????????????????????????????????


95

??????????????????????????????????????????????????????????????????????????????????

??????????????????????????????????????????????????????????????????????????????????

6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….

…….…………………………..……

k

l

YORP :1 CEUP :2

10w }13,..,1{3 W

1

1

}13..1{

3333 )|(131maxargˆ

jMAkl gGPa

kl 3g

1

1

}13..1{}13..1{}13..1{

221211 131

131

131

jM

jM

jM PPP

1

1


96

????????????????????????????????????????????????????????????????????????????????

????????????????????????????????????????????????????????????????????????????????

6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….

…….…………………………..……

k

l

YORP :1 CEUP :2

10w }14,..,1{2 W

1

1

}14..1{

4444 )|(141maxargˆ

jMAkl gGPa

kl 4g

1

1

}14..1{}14..1{}14..1{

221112 141

141

141

jM

jM

jM PPP

1

1

1

2


97

1112222222222222221111111111111122222222222222222222222221111111111111222222

1111111112222222211111111111111112222 ???????????????????????????????????????????

1112222222222222221111111111111122222 ???????????????????????????????????????????

1111111112222222211111111111111112222222222222222111111111111111111111111111

6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….

…….…………………………..……

k

l

YORP :1 CEUP :2

10w }10,..,10{ iiWi

iii Wj

MWj

MWj

M PPP221112 21

1211

211

1

2

i

klWj

iiMAkli gGPa )|(211maxargˆ 1ig


98

Experimental Results- GEDI 1-pop Imputation

1,444 individuals trained on HAPMAP

CEU haplotype reference panel

Imputed (after masking) 1% of SNPs on chromosome 22

99

Binomial Distribution

n: # of successive trials (reads)p: the probability of a success (correct map)1-p: =q , theprobability of a failure (incorrect map)

To call a heterozygous genotype, each allele must be covered by at least one read and the binomial probability for the observed number of 0 and 1 alleles must be at least 0.01

100

To determine quality scores, Phred first calculates several parameters related to peak shape and peak resolution at each base. Phred then uses these parameters to look up a corresponding quality score in huge lookup tables. These lookup tables were generated from sequence traces where the correct sequence was known, and are hard coded in Phred; different lookup tables are used for different sequencing chemistries and machines

Phred Score

101

Conditional Probability for Heterozygous Genotypes

i

i

c

riririi GP

21

21)1(

21)1|r(

r)()(

102


Calculating the forward probability for each HMM state;


103

Model Training- Details

Initial founder probabilities P(f1), P(f’1), transition probabilities P(fi+1|fi), P(f’i+1|f’i), and emission probabilities P(hi|fi), P(h’i|f’i) trained using the Baum-Welch algorithm from haplotypes inferred from the populations of origin for mother/father

P(gi|hi,h’i) set to 1 if h+h’i=gi and to 0 otherwise

This implies that conditional probabilities for sets of reads are given by the formulas derived for the single SNP case:

1)(r

)(

0)(r

)( )1()0|r(ir

rir

irr

iriiii

GP

0)(r

)(

1)(r

)( )1()2|r(ir

rir

irr

iriiii

GP

ic

ii GP

21)1|r(

)(1)(

)()(

)()(

)(1)(, 1

221

2)|( ir

irir

iriir

irir

iri

iijigggGrRP

104

Implementation Details

iii

iiii

ghhhh

iiiiii

fffhPfhPg

'

''

}1,0{,

'',

)|()|()(

Forward recurrences:

)()( '11

1, ' fPfP

ii ff

K

fi

iffii

K

fii

iff

iff

iii

iiiii

gffPffP1

11

,11

11

,,1

'11'

1

'11

' )()|()|(

Backward recurrences are similar 105

Allele coverage for heterozygous SNPs (Watson 454 @ 5.85x avg. coverage)

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

Reference allele coverage

Varia

nt a

llele

cov

erag

e


-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6


Varia

nt a

llele

cov

erag

e

Single SNP Genotyping- Incorporating Base Call Uncertainty

Let ri denote the set of mapped reads covering SNP locus i and ci=| ri |

For a read r in ri , r(i) denotes the allele observed at locus i If qr(i) is the phred quality score of r(i), the probability that r(i)

is incorrect is given by The probability of observing read set ri conditional on

having genotype gi is then given by:

1)(r

)(

0)(r

)( )1()0|r(ir

rir

irr

iriiii

GP

10/)(

)(10 irqir

0)(r

)(

1)(r

)( )1()2|r(ir

rir

irr

iriiii

GP

ic

ii GP

21)1|r(

111

Experimental Results-Read Data

Subset of James Watson’s 454 reads 74.4 million reads with quality scores (of 106.5 million

reads used in [Wheeler et al 08]) downloaded from ftp://ftp.ncbi.nih.gov/pub/TraceDB/Personal_Genomics/Watson/

Average read length ~265 bpRead length distribution

0

500000

1000000

1500000

2000000

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 112

ftp://ftp.ncbi.nih.gov/pub/TraceDB/Personal_Genomics/Watson/



Reads mapped on human genome build 36.3 using the nucmer tool of the MUMmer package [Kurtz et al 04]

Default nucmer parameters (MUM size 20, min cluster size 65, max gap between adjacent matches 90)

Additional filtering: at least 90% of the read length matched to the genome, no more than 10 errors (mismatches or indels)

Reads meeting above conditions at multiple genome positions (likely coming from genomic repeats) were discarded

Simulated 454 reads generated using ReadSim [Schmid et al 07] were used to estimate mapping error rates:

FP rate: 0.37% FN rate: 21.16%

113


Average coverage by mapped reads of Hapmap SNPs was 5.64x

Lower than [Wheeler et al 08] since we start with a subset of the reads and use more stringent mapping constraints

SNP coverage by mapped reads

0

100000

200000

300000

400000

500000

600000

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

114


CEU genotypes from latest Hapmap release (23a) were dowloaded from http://ftp.hapmap.org/genotypes/latest_ncbi_build36/forward/non-redundant/

Genotypes were phased using the ENT algorithm [Gusev et al 08] and inferred haplotypes were used to train the parent HMMs using Baum-Welch

Duplicate Affymetrix 500k SNP genotypes were downloaded from ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/by_series/GSE10668/GSE10668_family.soft.gz

We removed genotypes that were discordant in the two replicates and genotypes for which Hapmap and Affymetrix annotations had more than 5% in CEU same-strand allele frequency

115

http://ftp.hapmap.org/genotypes/latest_ncbi_build36/forward/non-redundant/


ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/by_series/GSE10668/GSE10668_family.soft.gz


Accuracy Comparison (Homozygous Genotypes)

97

97.5

98

98.5

99

99.5

100

0 10 20 30 40 50 60 70 80 90 100

% uncalled

% c

orre

ctly

cal

led

1SNP-Posterior Binomial0.01 HMM-Posterior

116

Gene-Seq Algorithms Posterior Decoding (see presentation) Greedy:

Markov Approximation:

Composite 2-SNP Viterbi Posterior Decoding, version 2

117

Introduction Helper Linkage Analysis: Study aimed at establishing linkage between

genes. Today linkage analysis serves as a way of gene-hunting and genetic testing.

Genetic linkage is the tendency for genes and other genetic markers to be inherited together because of their location near one another on the same chromosome.

Relative Risk is the risk of an event (or of developing a disease) relative to exposure.

Relative risk is a ratio of the probability of the event occurring in the exposed group versus a non-exposed group.

For example, if the probability of developing lung cancer among smokers was 20% and among non-smokers 1%, then the relative risk of cancer associated with smoking would be 20. Smokers would be twenty times as likely as non-smokers to develop lung cancer.

Association Analysis: Test for association between a genetic variation (e.g. SNP) and one or more quantitative traits

119

http://www.medterms.com/script/main/art.asp?articlekey=14018

http://en.wikipedia.org/wiki/Ratio

http://en.wikipedia.org/wiki/Probability


Association analysis

Genome-wide scans made possible by recent progress in Single Nucleotide Polymorphism (SNP) genotyping technologies

Introduction-Disease Gene Mapping

Linkage analysis

Very successful for Mendelian diseases (cystic fibrosis, Huntington’s,…)

Low power to detect genes with small relative risk in complex diseases [RischMerikangas’96]

Cases Controls

120

Genotype Error Detection-Motivation

Even low error levels can have large effects for some study designs (e.g. rare alleles, haplotype-based)

Errors as low as .1% can increase Type I error rates in haplotype sharing transmission disequilibrium test (HS-TDT) [Knapp&Becker04]

1% errors decrease power by 10-50% for linkage, and by 5-20% for association [Douglas et al. 00, Abecasis et al. 01]

121

Improved genotype calling algorithms [Marchini et al. 07, Nicolae et al. 06, Rabbee&Speed 05, Xiao

et al. 07] Explicit modeling in analysis methods

[Cheng 07, Hao & Wang 04, Liu et al. 07] Computationally complex

Separate error detection step Detected errors can be retyped, imputed, or ignored in

downstream analyses Common approach in pedigree genotype data analysis

[Abecasis et al. 02, Douglas et al. 00, Sobel et al. 02]


122

Complexity of Computing Maximum Phasing Probability

• For unrelated genotypes, computing maximum phasing probability is hard to approximate within a factor of O(f½-) unless ZPP=NP, where f is the number of founders• For trios, hard to approx. within O(f1/4 -)• Reductions from the clique problem

123

NGS Applications

Besides reducing costs of de novo genome sequencing, NGS has found many more apps:

Resequencing, transcriptomics (RNA-Seq), gene regulation (non-coding RNAs, transcription factor binding sites using ChIP-Seq), epigenetics (methylation, nucleosome modifications), metagenomics, paleogenomics, …

NGS is enabling personal genomics James Watson genome [Wheeler et al 08] sequenced

using 454 technology for ~$1 million compared to ~$100 million for the Sanger-sequenced Venter genome [Levy et al 07]

Thousands more individual genomes to be sequenced as part of 1000 Genomes Project 124


Challenges in Medical Applications of Sequencing

Medical sequencing focuses on genetic variation (SNPs, CNVs, genome rearrangements)

Requires accurate determination of both alleles at variable loci




To call a heterozygous genotype each allele must be covered by at least one read and the binomial probability for the observed number of 0 and 1 alleles must be at least 0.01

[Wendl&Wilson 08] predict that 21x coverage will be required for sequencing of normal tissue samples based on idealized theory that “neglects any heuristic inputs”

125

Prior Methods for Calling SNP Genotypes from Read Data

Prior methods are all based on allele coverage [Levy et al 07] require that each allele be covered by at

least 2 reads in order to be called [Wheeler et al 08] use hypothesis testing based on the

binomial distribution To call a heterozygous genotype must have each allele

covered by at least one read and the binomial probability for the observed number of 0 and 1 alleles must be at least 0.01

[Wendl&Wilson 08] generalize these methods by allowing an arbitrary minimum allele coverage k

126

Linkage Disequilibrium-Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads S. Dinakar1, Y. Hernandez2, J. Kennedy1, I. Mandoiu1, and Y. Wu11CSE Department, University of Connecticut 2Department of Computer Science, Hunter College

Motivation• Medical sequencing focuses on genetic variation (SNPs, CNVs,…)

– Requires determination of both alleles at variable loci; however, this is limited by coverage depth due to random nature of shotgun sequencing

– For the Venter and Watson genomes (both sequenced at ~7.5x average coverage), comparison with SNP genotyping chips shows ~75% accuracy for sequencing based calls of heterozygous SNPs [Levy et al 07, Wheeler et al 08]

• [Wendl&Wilson 08] estimate that 21x coverage will be required for medical sequencing of normal tissue samples – Based on idealized theory that “neglects heuristic inputs”

• We propose genotype calling methods that exploit following heuristic inputs– Quality scores reflecting uncertainty in sequencing data– Allele frequency and linkage disequilibrium info extracted from reference

panels such as Hapmap• Do heuristic inputs help?

– Experiments on a subset of James Watson 454 reads show that accuracy achieved by the binomial test of [Wheeler et al. 08] for 5.6x mapped read coverage is achieved by our methods using less than 1/4 of the reads

• Applying Bayes’ formula:

– Where are allele frequencies inferred from a representative panel

Single SNP Genotype Calling• Let ri denote the arbitrarily ordered set of mapped reads

covering SNP locus i and ci=| ri |– For a read r in ri , r(i) denotes the allele observed at locus i– If qr(i) is the phred quality score of r(i), the probability that r(i) is incorrect

is given by

• The probability of observing read set ri conditional on having genotype Gi:

1)(r

)(

0)(r

)( )1()0|r(ir

rir

irr

iriiii

GP

10/)(

)(10 irqir

0)(r

)(

1)(r

)( )1()2|r(ir

rir

irr

iriiii

GP ic

ii GP

21)1|r(

}2,1,0{)|r()(

)|r()()r|(g iiii

iiiiiii gGPgGP

gGPgPgGP

)( ii gGP

F1 F2 Fn…

H1 H2 Hn

G1 G2 Gn

…R1,1 R2,1

F'1 F'2 F'n…

H'1 H'2 H'n

R1,c … R2,c …Rn,1 Rn,c1 2 n

HMMs representing LD in populations of origin for mother/father; similar to models used in [Scheet & Stephens 06, Rastas et al. 08, Kennedy et al. 08]

Multilocus Model

• Initial founder probabilities P(f1), P(f’1), transition probabilities P(f i+1|fi), P(f’i+1|f’i), and emission probabilities P(hi|fi), P(h’i|f’i) trained using the Baum-Welch algorithm from haplotypes inferred from the populations of origin for mother/father

• P(gi|hi,h’i) set to 1 if h+h’i=gi and to 0 otherwise

•

– This implies that conditional probabilities for sets of reads are given by the formulas derived for the single SNP case:

Model Training

)(1)(

)()(

)()(

)(1)(, 1

221

2)|( ir

irir

iriir

irir

iri

iijigggGrRP

1)(r

)(

0)(r

)( )1()0|r(ir

rir

irr

iriiii

GP

0)(r

)(

1)(r

)( )1()2|r(ir

rir

irr

iriiii

GP ic

ii GP

21)1|r(

• Joint probabilities can be computed using a forward-backward algorithm:

• Direct implementation gives O(m+nK4) time, where – m = number of reads– n = number of SNPs– K = number of founder haplotypes in HMMs

• Runtime reduced to O(m+nK3) using speed-up idea similar to [Rastas et al. 08, Kennedy et al. 08]

)()|r()r,( '' ''1 ,1 ,, i

iff

K

fi

ffi

ff

K

fiii ggPgPiii iiiii

Posterior Decoding Algorithm1. For each i = 1..n, compute2. Return *)*,...,(* 1 nggg

)r,(maxarg)r|(maxarg* igigi gPgPgii

• Read data– 74.4 million reads with quality scores (of 106.5 million reads used in [Wheeler et

al 08]) downloaded from ftp://ftp.ncbi.nih.gov/pub/TraceDB/Personal_Genomics/Watson/

– Average read length ~265 bp

• Reference population data– CEU genotypes from latest Hapmap release (23a) from


– Genotypes were phased using the ENT algorithm [Gusev et al. 08] and inferred haplotypes were used to train the parent HMM models using the Baum-Welch algorithm

• Genotype data– Duplicate Affymetrix 500k SNP genotypes downloaded from


– We removed genotypes that were discordant in the two replicates and genotypes for which Hapmap and Affymetrix annotations had more than 5% in CEU same-strand allele frequency

Experimental Setup

• Reads mapped on human genome build 36.3 using the nucmertool of the MUMmer package [Kurtz et al 04]– Default nucmer parameters:

• MUM size 20, min cluster size 65, max gap between adjacent matches 90– Additional filtering

• At least 90% of the read length matched to the genome, no more than 10 errors (mismatches or indels)

– Reads meeting above conditions at multiple genome positions were discarded

• These reads are coming from genomic repeats, and is difficult to accurately map them

• Simulated 454 reads generated using ReadSim [Schmid et al 07] were used to estimate mapping error rates:– Estimated mapping FP rate is 0.37%

Read Mapping Procedure

Hapmap SNP coverage by mapped reads

0

100000

200000

300000

400000

500000

600000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

• Average coverage by mapped reads of Hapmap SNPs was 5.64x– Lower than [Wheeler et al 08] since we start with a subset of the reads

and use more stringent mapping constraints

Read Mapping Results

Homozygous Genotypes

97

97.5

98

98.5

99

99.5

100

0 10 20 30 40 50

% uncalled

% c

orre

ctly

cal

led


Genotyping Accuracy at 5.64x Read Coverage

Heterozygous Genotypes

80

82

84

86

88

90

92

94

96

98

100

0 10 20 30 40 50

% uncalled

% c

orre

ctly

cal

led


Conclusions & Ongoing Work• Exploiting “heuristic inputs” such as quality scores, population allele

frequency, and LD information yields significant improvements in genotyping calling accuracy from low-coverage sequencing data– Improvement depends on the coverage depth (higher at lower coverage)

• Accuracy achieved by binomial test of [Wheeler et al. 08] for 5.6x mapped read coverage is achieved by HMM-based posterior decoding using less than 1/4 of the reads; accuracy achieved by the binomial test for 2.8x coverage is achieved using less than 1/8 of the reads

– Small gain from incorporating quality scores may be due to poor calibration of 454 quality scores [Brockman et al. 08, Quinlan et al. 08]

– Although the evaluation is on 454 reads, our methods are well-suited for sequencing technologies with shorter reads

• Ongoing work includes modeling ambiguities in read mapping and extending the methods to population sequencing data (removing the need for reference panels)

ACKNOWLEDGEMENTSThis work was supported in part by NSF under awards IIS-0546457 and DBI-0543365 to IM and IIS-0803440 to YW. SD and YH performed this research as part of the Summer REU program “Bio-Grid Initiatives for Interdisciplinary Research and Education" funded by NSF under award CCF-0755373.

Genotyping Accuracy at Lower Read Coverage

70

75

80

85

90

95

100

0 10 20 30 40 50 60 70 80 90 100

% uncalled

% c

orre

ctly

cal

led

HMM-Posterior 1/16 HMM-Posterior 1/8 HMM-Posterior 1/4 HMM-Posterior 1/2 HMM-Posterior 1/1Binomial0.01 1/16 Binomial0.01 1/8 Binomial0.01 1/4 Binomial0.01 1/2 Binomial0.01 1/1

>gi|88943037|ref|NT_113796.1|Hs1_111515 Homo sapiens chromosome 1 genomic contig, reference assemblyGAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAGCTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTGGATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGTAATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCCCTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATATTTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGAAATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAGTCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGAATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC


>gnl|ti|1779718824 name:EI1W3PE02ILQXT28 28 28 28 26 28 28 40 34 14 44 36 23 13 2 27 42 35 21 727 42 35 21 6 28 43 36 22 10 27 42 35 20 6 28 43 36 22 928 43 36 22 9 28 44 36 24 14 4 28 28 28 27 28 26 26 35 2640 34 18 3 28 28 28 27 33 24 26 28 28 28 40 33 14 28 36 2726 26 37 29 28 28 28 28 27 28 28 28 37 28 27 27 28 36 28 3728 28 28 27 28 28 28 24 28 28 27 28 28 37 29 36 27 27 28 2728 33 23 28 33 23 28 36 27 33 23 28 35 25 28 28 36 27 36 2728 28 28 24 28 37 29 28 19 28 26 37 29 26 39 33 13 37 28 2828 21 24 28 27 41 34 15 28 36 27 26 28 24 35 27 28 40 34 15


>gnl|ti|1779718824 name:EI1W3PE02ILQXTTCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTTGTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATTCTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGTTAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC>gnl|ti|1779718825 name:EI1W3PE02GTXK0TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTTTAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGCAGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCAGGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA


Proposed Pipeline for Genotype Calling from Short Reads

Mapped reads

Hapmap genotypes90 20934216 F 0 02110001?0100210010011002122201210211?122122021200018 F 15 1621100012010021001001100?100201?10111110111?021200015 M 0 0211200100120012010011200101101010111110111102120007 M 0 02110001001000200122110001111011100111?1212102220008 F 0 0011202100120022012211200101101210211122111?012000012 F 9 10211000100100020012211000101101110011121212102200009 M 0 0011?001?012002201221120010?1012102111221111012000011 M 7 821100210010002001221100012110111001112121210222000

90 20934216 F 0 02110001?0100210010011002122201210211?122122021200018 F 15 1621100012010021001001100?100201?10111110111?021200015 M 0 0211200100120012010011200101101010111110111102120007 M 0 02110001001000200122110001111011100111?1212102220008 F 0 0011202100120022012211200101101210211122111?012000012 F 9 10211000100100020012211000101101110011121212102200009 M 0 0011?001?012002201221120010?1012102111221111012000011 M 7 821100210010002001221100012110111001112121210222000

90 20934216 F 0 02110001?0100210010011002122201210211?122122021200018 F 15 1621100012010021001001100?100201?10111110111?021200015 M 0 0211200100120012010011200101101010111110111102120007 M 0 02110001001000200122110001111011100111?1212102220008 F 0 0011202100120022012211200101101210211122111?012000012 F 9 10211000100100020012211000101101110011121212102200009 M 0 0011?001?012002201221120010?1012102111221111012000011 M 7 821100210010002001221100012110111001112121210222000

Reference genome sequence


…

…

…

… …

…

…



Read sequences

Quality scores

SNP genotype callsrs12095710 T T 9.988139e-01rs12127179 C T 9.986735e-01rs11800791 G G 9.977713e-01rs11578310 G G 9.980062e-01rs1287622 G G 8.644588e-01 rs11804808 C C 9.977779e-01rs17471528 A G 5.236099e-01rs11804835 C C 9.977759e-01rs11804836 C C 9.977925e-01rs1287623 G G 9.646510e-01 rs13374307 G G 9.989084e-01rs12122008 G G 5.121655e-01rs17431341 A C 5.290652e-01rs881635 G G 9.978737e-01 rs9700130 A A 9.989940e-01 rs11121600 A A 6.160199e-01rs12121542 A A 5.555713e-01rs11121605 T T 8.387705e-01rs12563779 G G 9.982776e-01rs11121607 C G 5.639239e-01rs11121608 G T 5.452936e-01rs12029742 G G 9.973527e-01rs562118 C C 9.738776e-01 rs12133533 A C 9.956655e-01rs11121648 G G 9.077355e-01rs9662691 C C 9.988648e-01 rs11805141 C C 9.928786e-01rs1287635 C C 6.113270e-01

ORIGINAL PROPOSAL PRESENTATION NEXT

Efficient Algorithms for SNP Genotype Data Analysis using

HiddenMarkov Models of Haplotype

Diversity

Justin KennedyDissertation Proposal for the Degree of Doctorate in Philosophy

Computer Science & Engineering DepartmentUniversity of Connecticut

129

Outline

Introduction



Conclusion130

Introduction-Single Nucleotide Polymorphisms


High density in the human genome: 1.2x107 out of 3109 base pairs

Vast majority bi-allelic 0/1 encoding (major/minor resp.)

SNP Genotypes are critical to Disease-Gene Mapping




131

Introduction- Why SNP Genotypes?

Single Nucleotide Polymorphisms (SNPs) have become the genetic marker of choice for genome wide association studies (GWASs)

GWAS: Method for mapping disease associated genes by typing a dense set of markers in large numbers of cases and controls followed by a statistical test of association. Provides Higher statistical power compared to other gene mapping methods such as linkage for uncovering genetic basis of complex diseases

Ongoing GWASs generate a deluge of genotype data Genetic Association Information Network (GAIN): 6 studies totaling 18,000

individuals typed at 500,000 to 940,000 SNP loci Wellcome Trust Case-Control Consortium (WTCCC): 7 studies totaling

17,000 individuals typed at 500,000 SNP

Major concern: quality of genotype data132

Introduction-Computational Challenges to Disease Gene Mapping

Genotype error detection: Low levels of genotyping errors can decrease statistical power and invalidate statistical tests for disease association based on haplotypes

Handling structural variation data

provided by new sequencing technologies: Accurate determination of both alleles at variable loci is essential, and is limited by coverage depth due to random nature of shotgun sequencing

133

Outline

Introduction Genotype Error Detection using Hidden Markov Models of

Haplotype Diversity Motivation Likelihood Sensitivity Approach to Error Detection

Hidden Markov Model of Haplotype Diversity

Efficiently Computable Likelihood functions

Experimental Results Single Individual Genotyping from Low-Coverage

Sequencing Data Conclusion

134


A real problem despite advances in genotyping technology [Zaitlen et al. 2005] found 1.1% inconsistencies among the 20 million dbSNP

genotypes typed multiple times 1% errors decrease power by 10-50% for linkage, and by 5-20% for

association [Douglas et al. 00, Abecasis et al. 01] Error types

Systematic errors (e.g., assay failure) detected by HWE test [Hosking et al. 2004]

For pedigree data some errors detected as Mendelian Inconsistencies (MIs) E.g. Only ~30% detectable as MIs for trios [Gordon et al. 1999]

Undetected errors Methods for Handling Undetected errors:

Improved genotype calling algorithms [Marchini et al. 07,, Xiao et al. 07] Explicit modeling in analysis methods [Cheng 07, Hao & Wang 04, Liu et al.

07] Separate error detection step

Detected errors can be retyped, imputed, or ignored in downstream analyses

135

Genotype Error Detection-Haplotypes and Genotypes

Haplotype: description of SNP alleles on a chromosome 0/1 vector: 0 for major allele, 1 for minor

Diploids: two homologous copies of each autosomal chromosome

One inherited from mother and one from father Genotype: description of alleles on both chromosomes

0/1/2 vector: 0 (1) - both chromosomes contain the major (minor) allele; 2 - the chromosomes contain different alleles

011000110001100010021200210


genotype136


0 1 2 1 0 2

0 2 2 1 0 2

0 2 2 1 0 2

Mother Father

Child


0 1 1 1 0 0 h10 0 0 1 0 1 h3

0 1 1 1 0 0 h1

0 1 0 1 0 1 h2

0 0 0 1 0 1 h3

0 1 1 1 0 0 h4

)()()()( MAX)( 4321 hphphphpTL

137


0 1 2 1 0 2

0 2 2 1 0 2

0 2 2 1 0 2

Mother Father

Child


)()()()( MAX)( 4321 hphphphpTL

? 0 1 0 1 0 1 h’ 1 0 0 0 1 0 0 h’ 3

0 1 0 1 0 1 h’1

0 1 1 1 0 0 h’2

0 0 0 1 0 0 h’ 3

0 1 1 1 0 1 h’ 4

Likelihood of best phasing for modified trio T’

)'()'()'()'( MAX)'( 4321 hphphphpTL 138

0 1 2 1 0 2

0 2 2 1 0 2

0 2 2 1 0 2

Mother Father

Child

0 1 2 1 0 2

0 2 2 1 0 2

0 2 2 1 0 2

Mother Father

Child

?

Large change in likelihood suggests likely error Flag genotype as an error if L(T’)/L(T) > R, where R is the

detection threshold (e.g., R=104)


139

Genotype Error Detection- Likelihood Sensitivity Approach to Error Detection

[Becker et al. 06] Implementation in FAMHAP Software

Window-based algorithm For each window including the SNP

under test, generate list of H most frequent haplotypes (default H=50)

Find most likely trio phasings by pruned search over the H4 quadruples of frequent haplotypes

Flag genotype as an error if L(T’)/L(T) > R for at least one window

Mother …201012 1 02210...Father …201202 2 10211...Child …000120 2 21021...

140

Genotype Error Detection- Limitations of FAMHAP

Truncating the list of haplotypes to size H may lead to sub-optimal phasings and inaccurate L(T) values

False positives caused by nearby errors (due to the use of multiple short windows)

Our approach: HMM of haplotype frequencies all haplotypes

represented + no need for short windows Alternate likelihood functions scalable runtime

141

Genotype Error Detection-Hidden Markov Model of Haplotype Diversity

Similar HMMs proposed by [Kimmel &Shamir 05, Rastas et al. 05, Schwartz 04]

Paths with high transition probability correspond to “founder” haplotypes

Haplotype sequence/paths computed using Viterbi and forward algorithms

K= #Founders(E.g. K=4)

Transition Prob

Emission Prob

N= #SNPs(E.g. N=5)

:),'( qqj:);( qjE

142


Training: 2- step algorithm that exploits pedigree info

Step 1: Obtain haplotypes from using either: ENT: A pedigree-aware haplotype phasing algorithm based on

entropy-minimization Haplotype reference panel (e.g. HAPMAP)

Step 2: train HMM based on inferred haplotypes, using Baum-Welch

:),'( qqj


Transition Prob

Emission Prob

N= #SNPs(E.g. N=5)

:);( qjE

143

Genotype Error Detection- Alternate Likelihood Functions

• Viterbi probability (ViterbiProb): Maximum probability of a set of 4 HMM paths that emit 4 haplotypes compatible with the trio

• Probability of Viterbi Haplotypes (ViterbiHaps): Product of total probabilities of the 4 Viterbi haplotypes

• Total Trio Probability (TotalProb): Total probability P(T) that the HMM emits four haplotypes that explain trio T along all possible 4-tuples of paths

),'()';1({max);();( 41'

qqqqqq jQ

jVjEjVj

4

1'

)},'()';1({);();(jQ

jjTjEjTq

qqqqq 144

Genotype Error Detection- Speed Up of Viterbi Probability

For a fixed trio, Viterbi paths can be found using a 4-path version of Viterbi’s algorithm in time

K3 speed-up by reuse of common terms (similar to [Rastas et al. 05]):

)( 8NKO

)},'()',,,;1({max),,,;(),,,;1( 4443213'43214321 14qqqqqqjSqqqqjEqqqqjV jQq j

)},'()',',,;1({max)',,,;( 3343212'43213 13qqqqqqjSqqqqjS jQq j

)},'()',',',;1({max)',',,;( 2243211'43212 12qqqqqqjSqqqqjS jQq j

)},'()',',',';1({max)',',',;( 114321'43211 11qqqqqqjVqqqqjS jQq j

145

Genotype Error Detection- Overall Function Runtimes

Viterbi probability Likelihoods of all 3N modified trios can be computed

within time using forward-backward algorithm Overall runtime for M trios

Probability of Viterbi haplotypes Obtain haplotypes from standard traceback, then

compute haplotype probabilities using forward algorithms

Overall runtime Total trio probability

Similar pre-computation speed-up & forward-backward algorithm

Overall runtime

))(( 25 KNNKMO

)( 5MNKO

)( 5MNKO)( 5NKO

146

Genotype Error Detection- Experimental Results (Setup)

Real dataset [Becker et al. 2006] 35 SNP loci covering a region of 91kb 551 trios

Synthetic datasets 35 SNPs, 551 trios Preserved missing data pattern of real

dataset Haplotypes assigned to trios based on

frequencies inferred from real dataset 1% error rate using random allele insertion

model147

Genotype Error Detection- Comparison of Likelihood

Functions

-0.005 0.005 0.0150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

VitHaps-PVitProb-PTotalProb-PVitHaps-CVitProb-CTotalProb-C

FP rate

Sens

itivi

ty

Sensitivity=TP/(TP+TN)

False Positive rate = 1 - TN/(FP+TN) 148

Distribution of Log-Likelihood Ratios for TotalTrioProb

Parents-TRIOS

1

10

100

1000

10000

100000

1000000

0

0.27

0.54

0.81

1.08

1.35

1.62

1.89

2.16

2.43 2.

7

2.97

3.24

3.51

3.78

4.05

4.32

4.59

4.86

5.13 5.

4

5.67

5.94

NO_ERR ERR

Children-TRIOS

1

10

100

1000

10000

100000

1000000

0

0.27

0.54

0.81

1.08

1.35

1.62

1.89

2.16

2.43 2.

7

2.97

3.24

3.51

3.78

4.05

4.32

4.59

4.86

5.13 5.

4

5.67

5.94

NO_ERR ERR

Same-locus errors in parents

149

Genotype Error Detection-“Combined” Detection Method

Compute 4 likelihood ratios

Trio Mother-child duo Father-child duo Child (unrelated)

Flag as error if all ratios are above detection threshold

150

Distribution of Log-Likelihood Ratios for Combined Method

Parents-COMBINED

1

10

100

1000

10000

100000

1000000

0

0.27

0.54

0.81

1.08

1.35

1.62

1.89

2.16

2.43 2.7

2.97

3.24

3.51

3.78

4.05

4.32

4.59

4.86

5.13 5.4

5.67

5.94

NO_ERR ERR

Children-COMBINED

1

10

100

1000

10000

100000

1000000

0

0.28

0.56

0.84

1.12 1.4

1.68

1.96

2.24

2.52 2.8

3.08

3.36

3.64

3.92 4.2

4.48

4.76

5.04

5.32 5.6

5.88

NO_ERR ERR

151

Comparison with FAMHAP (Children)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.005 0.01 0.015

Sens

itivi

ty

FP rate

#FP=#FN line

TotalProb-TRIO

TotalProb-COMBINED

FAMHAP

152

Comparison with FAMHAP (Parents)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.005 0.01 0.015

Sens

itivi

ty

FP rate

#FP=#FN line

TotalProb-TRIO

TotalProb-COMBINED

FAMHAP

153

Outline

Introduction Genotype Error Detection using Hidden Markov

Models of Haplotype Diversity Single Individual Genotyping from Low-Coverage

Sequencing Data Motivation Single SNP Calling Algorithm Multilocus HMM Calling Algorithm Experimental Results

Conclusion

154

Low Coverage Genotyping-Next Generation Sequencing (NGS)

Illumina / Solexa Genetic Analyzer 1G1000 Mb/run, 35bp reads

Roche / 454 Genome Sequencer FLX100 Mb/run, 400bp reads

Applied BiosystemsSOLiD3000 Mb/run, 25-35bp reads

By several orders of magnitude, NGS delivers higher throughput of sequencing reads compared to older technologies (e.g. Sanger sequencing)

More improvements expected in quest for $1,000 genome

155

Low Coverage Genotyping-NGS Applications and Challenges

NGS is enabling many applications, including personal genomics ~$1 million for sequencing James Watson genome [Wheeler et al 08]

using 454 technology. ~$100 million for the Sanger-sequenced Venter genome [Levy et al 07] Thousands more individual genomes to be sequenced as part of

1000 Genomes Project Challenges:

Sequencing requires accurate determination of genetic variation (e.g. SNPs)



[Wheeler et al 08] use hypothesis testing based on binomial distribution [Wendl&Wilson 08] predict that 21x coverage will be required for

sequencing of normal tissue samples based on idealized theory that “neglects any heuristic inputs”

156


Low Coverage Genotyping-Do Heuristic Inputs Help?

We propose methods incorporating two additional sources of information:

Quality scores reflecting uncertainty in sequencing data

Linkage disequilibrium (LD) information and allele frequencies extracted from reference panels such as Hapmap

Experiments on a subset of the James Watson 454 reads show that our methods yield improved genotyping accuracy

Improvement depends on the coverage depth (higher at lower coverage), e.g., accuracy achieved by the binomial test of [Wheeler et al. 08] for 5.6-fold mapped read coverage is achieved by our methods using less than 1/4 of the reads 157

Low Coverage Genotyping-Pipeline for Single Genotype Calling

158

Single SNP Genotyping- Basic Notations

Biallelic SNPs: 0 = major allele, 1 = minor allele (reads with non-reference alleles are discarded)

SNP genotypes: 0/2 = homozygous major/minor, 1=heterozygous

Inferred genotypesMapped reads with allele 0

Mapped reads with allele 1012100120

Sequencing errors159

Single SNP Genotyping- Incorporating Base Call Uncertainty

Let ri denote the set of mapped reads covering SNP locus i and ci=| ri |

For a read r in ri , r(i) denotes the allele observed at locus i If qr(i) is the phred quality score of r(i), the probability that r(i)

is incorrect is given by The probability of observing read set ri conditional on

having genotype Gi is then given by:

1)(r

)(

0)(r

)( )1()0|r(ir

rir

irr

iriiii

GP

10/)(

)(10 irqir

0)(r

)(

1)(r

)( )1()2|r(ir

rir

irr

iriiii

GP

ic

ii GP

21)1|r(

160

Single SNP Genotype Calling

Applying Bayes’ formula:

Where are allele frequencies inferred from a representative panel

}2,1,0{)|r()(

)|r()()r|(g iiii

iiiiiii gGPgGP

gGPgPgGP

)( ii gGP

161

Low Coverage Genotyping-Pipeline for Multilocus Genotyping

162

Multilocus Genotyping-HF-HMM

F1 F2 Fn…

H1 H2 Hn

G1 G2 Gn

…R1,1 R2,1

F'1 F'2 F'n…

H'1 H'2 H'n

R1,c … R2,c …Rn,1 Rn,c1 2 n

HMMs representing LD in populations of origin for mother/father; similar to models used in [Scheet & Stephens 06, Rastas et al 08, Kennedy et al 08]

163

Multilocus Genotyping-HF-HMM Training

Training HMM based on Baum-Welch algorithm from haplotypes inferred from populations of origin for mother/father

Use haplotype reference panel (e.g. HAPMAP) for training Haplotypes

Conditional probabilities for read sets are given by the formulas derived for the single SNP case:

1)(r

)(

0)(r

)( )1()0|r(ir

rir

irr

iriiii

GP

0)(r

)(

1)(r

)( )1()2|r(ir

rir

irr

iriiii

GP

ic

ii GP

21)1|r( 164

Remark: maxgP(g | r) is hard to approximate within unless ZPP=NP, and thus the multilocus genotyping problem is NP-hard

Multilocus Genotyping Problem

GIVEN: • Shotgun read sets r=(r1, r2, … , rn)• Quality scores• Trained HMMs representing LD in populations of origin for

mother/fatherFIND:

• Multilocus genotype g*=(g*1,g*2,…,g*n) with maximum posterior probability, i.e., g*=argmaxg P(g | r)

)( 1 nO

165

Multilocus Genotyping-HMM-Posterior Decoding Algorithm

1. For each i = 1..n, compute

2. Return *)*,...,(* 1 nggg


166


)()|r()r,( '' ''1 ,1 ,, i

iff

K

fi

ffi

ff

K

fiii ggPgPiii iiiii

fi …

hi

gi

…r1,1ri,1

f’i …

h’i

r1,c …ri,c …Rn,1 Rn,c

1i n

…

…

167


)()|r()r,( '' ''1 ,1 ,, i

iff

K

fi

ffi

ff

K

fiii ggPgPiii iiiii

fi …

hi

gi

…r1,1ri,1

f’i …

h’i


1i n

…

…

168

)()|r()r,( '' ''1 ,1 ,, i

iff

K

fi

ffi

ff

K

fiii ggPgPiii iiiii

fi …

hi

gi

…r1,1ri,1

f’i …

h’i


1i n

…

…


169

)()|r()r,( '' ''1 ,1 ,, i

iff

K

fi

ffi

ff

K

fiii ggPgPiii iiiii

fi …

hi

gi

…r1,1ri,1

f’i …

h’i


1i n

…

…


170

)()|r()r,( '' ''1 ,1 ,, i

iff

K

fi

ffi

ff

K

fiii ggPgPiii iiiii

fi …

hi

gi

…r1,1ri,1

f’i …

h’i


1i n

…

…


171

Multilocus Genotyping-Runtime

Direct implementation gives O(m+nK4) time: m = number of reads n = number of SNPs K = number of founder haplotypes in HMMs

Runtime reduced to O(m+nK3) by reusing common terms:

where

K

f

iffii

iff

iff

iiiiiii

ffP1

1,

'1

'1,,

'1

'11

'11

'1

)|(

K

f

iffii

iff

iiiii

ffP1

,1,'1

'1

' )|(

}1,0{,

'',

'' )|()|()|()(

iiii

hhiiiiiiiii

iff

hhGrPfhPfhPg172

Low Coverage Genotyping-Experimental Results- Setup

Subset of James Watson’s 454 reads 74.4M of 106.5M reads

265 bp/read avg coverage: 5.64X

Quality scores included Reads mapped on human genome build 36.3

using the nucmer tool of the MUMmer package [Kurtz et al 04]

Estimated mapping error rates: FP rate: 0.37% FN rate: 21.16%

Haplotype reference panel used to train HMM generated from Hapmap CEU genotypes (release 23a)

173

Accuracy Comparison (Heterozygous Genotypes)

80

82

84

86

88

90

92

94

96

98

100

0 10 20 30 40 50 60 70 80 90 100

% uncalled

% c

orre

ctly

cal

led


174

Accuracy Comparison (All Genotypes)

93

94

95

96

97

98

99

100

0 20 40 60 80 100

% uncalled

% c

orre

ctly

cal

led


175

Accuracy at Varying Coverages (All Genotypes)

70

75

80

85

90

95

100

0 10 20 30 40 50 60 70 80 90 100

% uncalled

% c

orre

ctly

cal

led

Posterior 1/16 Posterior 1/8 Posterior 1/4 Posterior 1/2 Posterior 1/1Binomial 1/16 Binomial 1/8 Binomial 1/4 Binomial 1/2 Binomial 1/1

176

Outline

Introduction



Conclusion177

Conclusion Genotype Error Detection

Contributions: Proposed efficient methods for error detection in trio genotype data

based on an HMM of haplotype diversity Can exploit available pedigree info Yield improved detection accuracy compared to FAMHAP Runtime grows linearly in #SNPs and #individuals

Papers/Software: J. Kennedy, I.I. Mandoiu, and B. Pasaniuc. Genotype Error Detection

using Hidden Markov Models of Haplotype diversity. Journal of Computational Biology (to Appear).

J. Kennedy, I.I. Mandoiu and B. Pasaniuc. Genotype Error Detection using Hidden Markov Models of Haplotype Diversity. Proc. WABI 2007, R. Giancarlo and S. Hannenhalli (eds.), LNBI 4645:73-84, 2007

J. Kennedy, I.I. Mandoiu and B. Pasaniuc. Genotype Error Detection using Hidden Markov Models of haplotype diversity. In 3rd RECOMB Satellie Workshop on: Computational Methods for SNPs and Haplotypes, 2007

Software: GEDI (Genotype Error Detection and Imputation): Best poster award: J. Kennedy, I.I. Mandoiu and B. Pasaniuc. Genotype

Error Detection and Imputation using Hidden Markov Models of Haplotype Diversity. ISBRA 2008.

178

Conclusion Genotyping from low coverage sequencing reads

Contributions: Exploiting “heuristic inputs” such as quality scores and population

allele frequency and LD information yields significant improvements in genotyping calling accuracy from low-coverage sequencing data

LD information extracted from a reference panel gives highest benefit Relatively small gain from incorporating quality scores may be due in

part to the poor calibration of 454 quality scores [Brockman et al 08, Quinlan et al 08]

Although our evaluation is on 454 reads, the methods are well-suited for short read technologies

Papers/Software: S. Dinakar, Y. Hernandez, J. Kennedy, I. Mandoiu, and Y. Wu. Single

individual genotyping from low-coverage sequencing data (In Preparation)

Presentation: J.Kennedy. Linkage Disequilibrium Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads. DIMACS workshop on Computational Issues in Genetic Epidemiology.

Software: Gene-Seq179

Conclusion Future Work: Population-based

Genotyping from Low-Coverage Sequencing Data Extending the single individual

genotyping methods to population sequencing data (removing the need for reference panels)

Use Same HF-HMM as before, only training model based off of EM algorithm on population-level data provided. 180

Questions?

181

Acknowledgments

This work was supported in part by NSF (awards IIS-0546457, DBI-0543365, and CCF-0755373) and by the University of Connecticut Research Foundation

182

Click again for helper slides

183

HAPMAP: The International HAPMAP Project is an organization

whose goal is to develop a haplotype map of the human genome (the HapMap), which will describe the common patterns of human genetic variation.

HAPMAP is expected to be a key resource for researchers to find genetic variants affecting health, disease and responses to drugs and environmental factors. The information produced by the project is made freely available to researchers around the world.

The International HapMap Project is a collaboration among researchers at academic centers, non-profit biomedical research groups and private companies in Canada, China, Japan, Nigeria, the United Kingdom, and the United States.

184

http://en.wikipedia.org/wiki/Haplotype

http://en.wikipedia.org/wiki/Human_genome

http://en.wikipedia.org/wiki/Genetic_variability

http://en.wikipedia.org/wiki/Canada

http://en.wikipedia.org/wiki/China

http://en.wikipedia.org/wiki/Japan

http://en.wikipedia.org/wiki/Nigeria

http://en.wikipedia.org/wiki/United_Kingdom

http://en.wikipedia.org/wiki/United_States


Calculating the forward probability and the backward probability for each HMM state;


185

Error Detection Accuracy on Unrelated Genotype Data

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.005 0.01 0.015

Sens

itivi

ty

FP rate

#FP=#FN line

Len=10Kb (U)

Len=100Kb (U)

Len=1Mb (U)

Len=10Mb (U)

551 unrelated individuals Recombination & mutation rates of 10-8 per generation

per bp 35 SNPs within a region of 10kb-10Mb

186

TrioProb-Combined Results on Real Dataset

[Becker et al. 06] resequenced all trio members at 41 loci flagged by FAMHAP-3

26 SNP genotypes in 23 trios were identified as true errors 41*3-26=97 resequenced SNP genotypes agree with

original calls (or are unknown)

Total Signals True Positives False Positives Unknown

FP Rate 1%.5% .1% 1%.5% .1% 1%.5% .1% 1%.5% .1%

Parents 218 127 69 9 9 8 1 0 0 208 118 91

Children 104 74 24 11 11 11 3 3 2 90 60 11

Total 322 201 93 20 20 19 4 3 2 298 178 72

187

Error Model Comparison

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.005 0.01 0.015

FP rate

Sens

itivi

ty

random allele (P)random geno (P)hetero-to-homo (P)homo-to-hetero (P)random allele (C)random geno (C)hetero-to-homo (C)homo-to-hetero (C)

188

Effect of Population Size

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.005 0.01 0.015

FP rate

Sens

itivi

ty

n=551 (P)n=129 (P)n=30 (P)n=551 (C)n=129 (C)n=30 (C)

189

Binomial Distribution

n: # of successive trials (reads)p: the probability of a success (correct map)1-p: =q , theprobability of a failure (incorrect map)

To call a heterozygous genotype, each allele must be covered by at least one read and the binomial probability for the observed number of 0 and 1 alleles must be at least 0.01

190

Conditional Probability for Heterozygous Genotypes

i

i

c

riririi GP

21

21)1(

21)1|r(

r)()(

191

Model Training- Details

Initial founder probabilities P(f1), P(f’1), transition probabilities P(fi+1|fi), P(f’i+1|f’i), and emission probabilities P(hi|fi), P(h’i|f’i) trained using the Baum-Welch algorithm from haplotypes inferred from the populations of origin for mother/father

P(gi|hi,h’i) set to 1 if h+h’i=gi and to 0 otherwise

This implies that conditional probabilities for sets of reads are given by the formulas derived for the single SNP case:

1)(r

)(

0)(r

)( )1()0|r(ir

rir

irr

iriiii

GP

0)(r

)(

1)(r

)( )1()2|r(ir

rir

irr

iriiii

GP

ic

ii GP

21)1|r(

)(1)(

)()(

)()(

)(1)(, 1

221

2)|( ir

irir

iriir

irir

iri

iijigggGrRP

192

Implementation Details

iii

iiii

ghhhh

iiiiii

fffhPfhPg

'

''

}1,0{,

'',

)|()|()(

Forward recurrences:

)()( '11

1, ' fPfP

ii ff

K

fi

iffii

K

fii

iff

iff

i

iii

iiiigffPffP

11

1,1

11

1,,

1

'11'

1

'11

' )()|()|(

Backward recurrences are similar 193


Subset of James Watson’s 454 reads 74.4 million reads with quality scores (of 106.5 million

reads used in [Wheeler et al 08]) downloaded from ftp://ftp.ncbi.nih.gov/pub/TraceDB/Personal_Genomics/Watson/

Average read length ~265 bpRead length distribution

0

500000

1000000

1500000

2000000

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 194




Reads mapped on human genome build 36.3 using the nucmer tool of the MUMmer package [Kurtz et al 04]

Default nucmer parameters (MUM size 20, min cluster size 65, max gap between adjacent matches 90)

Additional filtering: at least 90% of the read length matched to the genome, no more than 10 errors (mismatches or indels)

Reads meeting above conditions at multiple genome positions (likely coming from genomic repeats) were discarded

Simulated 454 reads generated using ReadSim [Schmid et al 07] were used to estimate mapping error rates:

FP rate: 0.37% FN rate: 21.16%

195


Average coverage by mapped reads of Hapmap SNPs was 5.64x

Lower than [Wheeler et al 08] since we start with a subset of the reads and use more stringent mapping constraints

SNP coverage by mapped reads

0

100000

200000

300000

400000

500000

600000

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

196


CEU genotypes from latest Hapmap release (23a) were dowloaded from http://ftp.hapmap.org/genotypes/latest_ncbi_build36/forward/non-redundant/

Genotypes were phased using the ENT algorithm [Gusev et al 08] and inferred haplotypes were used to train the parent HMMs using Baum-Welch

Duplicate Affymetrix 500k SNP genotypes were downloaded from ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/by_series/GSE10668/GSE10668_family.soft.gz

We removed genotypes that were discordant in the two replicates and genotypes for which Hapmap and Affymetrix annotations had more than 5% in CEU same-strand allele frequency

197





Accuracy Comparison (Homozygous Genotypes)

97

97.5

98

98.5

99

99.5

100

0 10 20 30 40 50 60 70 80 90 100

% uncalled

% c

orre

ctly

cal

led


198

Gene-Seq Algorithms Posterior Decoding (see presentation) Greedy:

Markov Approximation:

Composite 2-SNP Viterbi Posterior Decoding, version 2

199

Introduction Helper Linkage Analysis: Study aimed at establishing linkage between

genes. Today linkage analysis serves as a way of gene-hunting and genetic testing.

Genetic linkage is the tendency for genes and other genetic markers to be inherited together because of their location near one another on the same chromosome.

Relative Risk is the risk of an event (or of developing a disease) relative to exposure.

Relative risk is a ratio of the probability of the event occurring in the exposed group versus a non-exposed group.

For example, if the probability of developing lung cancer among smokers was 20% and among non-smokers 1%, then the relative risk of cancer associated with smoking would be 20. Smokers would be twenty times as likely as non-smokers to develop lung cancer.

Association Analysis: Test for association between a genetic variation (e.g. SNP) and one or more quantitative traits

201

http://www.medterms.com/script/main/art.asp?articlekey=14018

http://en.wikipedia.org/wiki/Ratio



Association analysis

Genome-wide scans made possible by recent progress in Single Nucleotide Polymorphism (SNP) genotyping technologies

Introduction-Disease Gene Mapping

Linkage analysis

Very successful for Mendelian diseases (cystic fibrosis, Huntington’s,…)

Low power to detect genes with small relative risk in complex diseases [RischMerikangas’96]

Cases Controls

202


Even low error levels can have large effects for some study designs (e.g. rare alleles, haplotype-based)

Errors as low as .1% can increase Type I error rates in haplotype sharing transmission disequilibrium test (HS-TDT) [Knapp&Becker04]

1% errors decrease power by 10-50% for linkage, and by 5-20% for association [Douglas et al. 00, Abecasis et al. 01]

203

Improved genotype calling algorithms [Marchini et al. 07, Nicolae et al. 06, Rabbee&Speed 05, Xiao

et al. 07] Explicit modeling in analysis methods

[Cheng 07, Hao & Wang 04, Liu et al. 07] Computationally complex

Separate error detection step Detected errors can be retyped, imputed, or ignored in

downstream analyses Common approach in pedigree genotype data analysis

[Abecasis et al. 02, Douglas et al. 00, Sobel et al. 02]


204

Complexity of Computing Maximum Phasing Probability

• For unrelated genotypes, computing maximum phasing probability is hard to approximate within a factor of O(f½-) unless ZPP=NP, where f is the number of founders• For trios, hard to approx. within O(f1/4 -)• Reductions from the clique problem

205

NGS Applications

Besides reducing costs of de novo genome sequencing, NGS has found many more apps:

Resequencing, transcriptomics (RNA-Seq), gene regulation (non-coding RNAs, transcription factor binding sites using ChIP-Seq), epigenetics (methylation, nucleosome modifications), metagenomics, paleogenomics, …

NGS is enabling personal genomics James Watson genome [Wheeler et al 08] sequenced

using 454 technology for ~$1 million compared to ~$100 million for the Sanger-sequenced Venter genome [Levy et al 07]

Thousands more individual genomes to be sequenced as part of 1000 Genomes Project 206


Challenges in Medical Applications of Sequencing

Medical sequencing focuses on genetic variation (SNPs, CNVs, genome rearrangements)

Requires accurate determination of both alleles at variable loci




To call a heterozygous genotype each allele must be covered by at least one read and the binomial probability for the observed number of 0 and 1 alleles must be at least 0.01

[Wendl&Wilson 08] predict that 21x coverage will be required for sequencing of normal tissue samples based on idealized theory that “neglects any heuristic inputs”

207

Prior Methods for Calling SNP Genotypes from Read Data

Prior methods are all based on allele coverage [Levy et al 07] require that each allele be covered by at

least 2 reads in order to be called [Wheeler et al 08] use hypothesis testing based on the

binomial distribution To call a heterozygous genotype must have each allele

covered by at least one read and the binomial probability for the observed number of 0 and 1 alleles must be at least 0.01

[Wendl&Wilson 08] generalize these methods by allowing an arbitrary minimum allele coverage k

208

Linkage Disequilibrium-Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads S. Dinakar1, Y. Hernandez2, J. Kennedy1, I. Mandoiu1, and Y. Wu11CSE Department, University of Connecticut 2Department of Computer Science, Hunter College

Motivation• Medical sequencing focuses on genetic variation (SNPs, CNVs,…)

– Requires determination of both alleles at variable loci; however, this is limited by coverage depth due to random nature of shotgun sequencing

– For the Venter and Watson genomes (both sequenced at ~7.5x average coverage), comparison with SNP genotyping chips shows ~75% accuracy for sequencing based calls of heterozygous SNPs [Levy et al 07, Wheeler et al 08]

• [Wendl&Wilson 08] estimate that 21x coverage will be required for medical sequencing of normal tissue samples – Based on idealized theory that “neglects heuristic inputs”

• We propose genotype calling methods that exploit following heuristic inputs– Quality scores reflecting uncertainty in sequencing data– Allele frequency and linkage disequilibrium info extracted from reference

panels such as Hapmap• Do heuristic inputs help?

– Experiments on a subset of James Watson 454 reads show that accuracy achieved by the binomial test of [Wheeler et al. 08] for 5.6x mapped read coverage is achieved by our methods using less than 1/4 of the reads

• Applying Bayes’ formula:

– Where are allele frequencies inferred from a representative panel

Single SNP Genotype Calling• Let ri denote the arbitrarily ordered set of mapped reads

covering SNP locus i and ci=| ri |– For a read r in ri , r(i) denotes the allele observed at locus i– If qr(i) is the phred quality score of r(i), the probability that r(i) is incorrect

is given by

• The probability of observing read set ri conditional on having genotype Gi:

1)(r

)(

0)(r

)( )1()0|r(ir

rir

irr

iriiii

GP

10/)(

)(10 irqir

0)(r

)(

1)(r

)( )1()2|r(ir

rir

irr

iriiii

GP ic

ii GP

21)1|r(

}2,1,0{)|r()(

)|r()()r|(g iiii

iiiiiii gGPgGP

gGPgPgGP

)( ii gGP

F1 F2 Fn…

H1 H2 Hn

G1 G2 Gn

…R1,1 R2,1

F'1 F'2 F'n…

H'1 H'2 H'n

R1,c … R2,c …Rn,1 Rn,c1 2 n

HMMs representing LD in populations of origin for mother/father; similar to models used in [Scheet & Stephens 06, Rastas et al. 08, Kennedy et al. 08]

Multilocus Model

• Initial founder probabilities P(f1), P(f’1), transition probabilities P(f i+1|fi), P(f’i+1|f’i), and emission probabilities P(hi|fi), P(h’i|f’i) trained using the Baum-Welch algorithm from haplotypes inferred from the populations of origin for mother/father

• P(gi|hi,h’i) set to 1 if h+h’i=gi and to 0 otherwise

•

– This implies that conditional probabilities for sets of reads are given by the formulas derived for the single SNP case:

Model Training

)(1)(

)()(

)()(

)(1)(, 1

221

2)|( ir

irir

iriir

irir

iri

iijigggGrRP

1)(r

)(

0)(r

)( )1()0|r(ir

rir

irr

iriiii

GP

0)(r

)(

1)(r

)( )1()2|r(ir

rir

irr

iriiii

GP ic

ii GP

21)1|r(

• Joint probabilities can be computed using a forward-backward algorithm:

• Direct implementation gives O(m+nK4) time, where – m = number of reads– n = number of SNPs– K = number of founder haplotypes in HMMs

• Runtime reduced to O(m+nK3) using speed-up idea similar to [Rastas et al. 08, Kennedy et al. 08]

)()|r()r,( '' ''1 ,1 ,, i

iff

K

fi

ffi

ff

K

fiii ggPgPiii iiiii

Posterior Decoding Algorithm1. For each i = 1..n, compute2. Return *)*,...,(* 1 nggg


• Read data– 74.4 million reads with quality scores (of 106.5 million reads used in [Wheeler et

al 08]) downloaded from ftp://ftp.ncbi.nih.gov/pub/TraceDB/Personal_Genomics/Watson/

– Average read length ~265 bp

• Reference population data– CEU genotypes from latest Hapmap release (23a) from


– Genotypes were phased using the ENT algorithm [Gusev et al. 08] and inferred haplotypes were used to train the parent HMM models using the Baum-Welch algorithm

• Genotype data– Duplicate Affymetrix 500k SNP genotypes downloaded from


– We removed genotypes that were discordant in the two replicates and genotypes for which Hapmap and Affymetrix annotations had more than 5% in CEU same-strand allele frequency

Experimental Setup

• Reads mapped on human genome build 36.3 using the nucmertool of the MUMmer package [Kurtz et al 04]– Default nucmer parameters:

• MUM size 20, min cluster size 65, max gap between adjacent matches 90– Additional filtering

• At least 90% of the read length matched to the genome, no more than 10 errors (mismatches or indels)

– Reads meeting above conditions at multiple genome positions were discarded

• These reads are coming from genomic repeats, and is difficult to accurately map them

• Simulated 454 reads generated using ReadSim [Schmid et al 07] were used to estimate mapping error rates:– Estimated mapping FP rate is 0.37%

Read Mapping Procedure

Hapmap SNP coverage by mapped reads

0

100000

200000

300000

400000

500000

600000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

• Average coverage by mapped reads of Hapmap SNPs was 5.64x– Lower than [Wheeler et al 08] since we start with a subset of the reads

and use more stringent mapping constraints

Read Mapping Results

Homozygous Genotypes

97

97.5

98

98.5

99

99.5

100

0 10 20 30 40 50

% uncalled

% c

orre

ctly

cal

led


Genotyping Accuracy at 5.64x Read Coverage

Heterozygous Genotypes

80

82

84

86

88

90

92

94

96

98

100

0 10 20 30 40 50

% uncalled

% c

orre

ctly

cal

led


Conclusions & Ongoing Work• Exploiting “heuristic inputs” such as quality scores, population allele

frequency, and LD information yields significant improvements in genotyping calling accuracy from low-coverage sequencing data– Improvement depends on the coverage depth (higher at lower coverage)

• Accuracy achieved by binomial test of [Wheeler et al. 08] for 5.6x mapped read coverage is achieved by HMM-based posterior decoding using less than 1/4 of the reads; accuracy achieved by the binomial test for 2.8x coverage is achieved using less than 1/8 of the reads

– Small gain from incorporating quality scores may be due to poor calibration of 454 quality scores [Brockman et al. 08, Quinlan et al. 08]

– Although the evaluation is on 454 reads, our methods are well-suited for sequencing technologies with shorter reads

• Ongoing work includes modeling ambiguities in read mapping and extending the methods to population sequencing data (removing the need for reference panels)

ACKNOWLEDGEMENTSThis work was supported in part by NSF under awards IIS-0546457 and DBI-0543365 to IM and IIS-0803440 to YW. SD and YH performed this research as part of the Summer REU program “Bio-Grid Initiatives for Interdisciplinary Research and Education" funded by NSF under award CCF-0755373.

Genotyping Accuracy at Lower Read Coverage

70

75

80

85

90

95

100

0 10 20 30 40 50 60 70 80 90 100

% uncalled

% c

orre

ctly

cal

led

HMM-Posterior 1/16 HMM-Posterior 1/8 HMM-Posterior 1/4 HMM-Posterior 1/2 HMM-Posterior 1/1Binomial0.01 1/16 Binomial0.01 1/8 Binomial0.01 1/4 Binomial0.01 1/2 Binomial0.01 1/1







Proposed Pipeline for Genotype Calling from Short Reads

Mapped reads

Hapmap genotypes90 20934216 F 0 02110001?0100210010011002122201210211?122122021200018 F 15 1621100012010021001001100?100201?10111110111?021200015 M 0 0211200100120012010011200101101010111110111102120007 M 0 02110001001000200122110001111011100111?1212102220008 F 0 0011202100120022012211200101101210211122111?012000012 F 9 10211000100100020012211000101101110011121212102200009 M 0 0011?001?012002201221120010?1012102111221111012000011 M 7 821100210010002001221100012110111001112121210222000

90 20934216 F 0 02110001?0100210010011002122201210211?122122021200018 F 15 1621100012010021001001100?100201?10111110111?021200015 M 0 0211200100120012010011200101101010111110111102120007 M 0 02110001001000200122110001111011100111?1212102220008 F 0 0011202100120022012211200101101210211122111?012000012 F 9 10211000100100020012211000101101110011121212102200009 M 0 0011?001?012002201221120010?1012102111221111012000011 M 7 821100210010002001221100012110111001112121210222000

90 20934216 F 0 02110001?0100210010011002122201210211?122122021200018 F 15 1621100012010021001001100?100201?10111110111?021200015 M 0 0211200100120012010011200101101010111110111102120007 M 0 02110001001000200122110001111011100111?1212102220008 F 0 0011202100120022012211200101101210211122111?012000012 F 9 10211000100100020012211000101101110011121212102200009 M 0 0011?001?012002201221120010?1012102111221111012000011 M 7 821100210010002001221100012110111001112121210222000

Reference genome sequence


…

…

…

… …

…

…



Read sequences

Quality scores

SNP genotype callsrs12095710 T T 9.988139e-01rs12127179 C T 9.986735e-01rs11800791 G G 9.977713e-01rs11578310 G G 9.980062e-01rs1287622 G G 8.644588e-01 rs11804808 C C 9.977779e-01rs17471528 A G 5.236099e-01rs11804835 C C 9.977759e-01rs11804836 C C 9.977925e-01rs1287623 G G 9.646510e-01 rs13374307 G G 9.989084e-01rs12122008 G G 5.121655e-01rs17431341 A C 5.290652e-01rs881635 G G 9.978737e-01 rs9700130 A A 9.989940e-01 rs11121600 A A 6.160199e-01rs12121542 A A 5.555713e-01rs11121605 T T 8.387705e-01rs12563779 G G 9.982776e-01rs11121607 C G 5.639239e-01rs11121608 G T 5.452936e-01rs12029742 G G 9.973527e-01rs562118 C C 9.738776e-01 rs12133533 A C 9.956655e-01rs11121648 G G 9.077355e-01rs9662691 C C 9.988648e-01 rs11805141 C C 9.928786e-01rs1287635 C C 6.113270e-01

GEDI-ADMX presentation from earlier this year (Ft lauderdale)

Imputation-based local ancestry inference in admixed

populationsJustin Kennedy

Computer Science and Engineering DepartmentUniversity of Connecticut

Joint work with I. Mandoiu and B. Pasaniuc

211

Outline Introduction

Factorial HMM of genotype data

Algorithms for genotype imputation and ancestry inference

Preliminary experimental results

Conclusion

212

Introduction- Motivation: Admixture mapping

Patterson et al, AJHG 74:979-1000, 2004 213

Introduction- Local ancestry inference problem

rs11095710 T T rs11117179 C T rs11800791 G G rs11578310 G Grs1187611 G Grs11804808 C C rs17471518 A G...

Given: Reference haplotypes for ancestral populations P1,…,PN Whole-genome SNP genotype data for extant individual

Find: Allele ancestries at each SNP locus


SNP genotypes

rs11095710 P1 P1rs11117179 P1 P1rs11800791 P1 P1rs11578310 P1 P2rs1187611 P1 P2rs11804808 P1 P2rs17471518 P1 P2...

1110001?0100110010011001111101110111?1111110111000 11100011010011001001100?100101?10111110111?0111000111100100110011010011100101101010111110111101110001110001001000100111110001111011100111?111110111000011101100110011011111100101101110111111111?011000011100010010001001111100010110111001111111110110000011?001?011001101111110010?1011101111111111011000011100110010001001111100011110111001111111110111000

Inferred local ancestry

1110001?0100110010011001111101110111?1111110111000 11100011010011001001100?100101?10111110111?0111000111100100110011010011100101101010111110111101110001110001001000100111110001111011100111?111110111000011101100110011011111100101101110111111111?011000011100010010001001111100010110111001111111110110000011?001?011001101111110010?1011101111111111011000011100110010001001111100011110111001111111110111000

1110001?0100110010011001111101110111?1111110111000 11100011010011001001100?100101?10111110111?0111000111100100110011010011100101101010111110111101110001110001001000100111110001111011100111?111110111000011101100110011011111100101101110111111111?011000011100010010001001111100010110111001111111110110000011?001?011001101111110010?1011101111111111011000011100110010001001111100011110111001111111110111000

214

Introduction- Previous work MANY methods

Ancestry inference at different granularities, assuming different kinds/amounts of info about genetic makeup of ancestral populations

Two main classes of methods HMM-based (exploit LD): SABER [Tang et al 06], SWITCH

[Sankararaman et al 08a], HAPAA [Sundquist et al. 08], …

Window-based (unlinked SNP Data): LAMP [Sankararaman et al 08b], WINPOP [Pasaniuc et al. 09]

Poor accuracy when ancestral populations are closely related (e.g. Japanese and Chinese)

Methods based on unlinked SNPs outperform methods that model LD!

215





Conclusion

216

Haplotype structure in panmictic populations

217

HMM of haplotype frequencies

Similar models proposed in [Schwartz 04, Rastas et al. 05, Kennedy et al. 07, Kimmel&Shamir 05, Scheet&Stephens 06,…]

K = 4(# founders)

n = 5(# SNPs)

218

Graphical model representation

Random variables for each locus i (i=1..n) Fi = founder haplotype at locus i; values between 1 and K Hi = observed allele at locus i; values: 0 (major) or 1

(minor) Model training

Based on reference haplotypes using Baum-Welch alg, or Based on unphased genotypes using EM [Rastas et al. 05]

Given haplotype h, P(H=h|M) can be computed in O(nK2) using a forward algorithm, where n=#SNPs, K=#founders

F1 F2 Fn…

H1 H2 Hn

219

F1 F2 Fn…

H1 H2 Hn

F'1 F'2 F'n…

H'1 H'2 H'n

G1 G2 Gn


klM

Random variable for each locus i (i=1..n) Gi = genotype at locus i; values: 0/1/2 (major hom./het./minor

hom.) 220





Conclusion

221

HMM Based Genotype Imputation

Probability of observing genotype at locus i given the known multilocus genotype with missing data at i:

gi is imputed as )|][(argmax }2,1,0{ MxggP ix

)|][(),|( MxggPMgxgP iii

x

222

fi …

hi

gi

f’i …

h’i

…

…


)()|( '' ''1 ,1 ,, i

iff

K

fi

ffi

ff

K

fgMgP

iii iiiii

223

fi …

hi

gi

f’i …

h’i

…

…


)()|( '' ''1 ,1 ,, i

iff

K

fi

ffi

ff

K

fgMgP

iii iiiii

224

fi …

hi

gi

f’i …

h’i

…

…


)()|( '' ''1 ,1 ,, i

iff

K

fi

ffi

ff

K

fgMgP

iii iiiii

225

fi …

hi

gi

f’i …

h’i

…

…


)()|( '' ''1 ,1 ,, i

iff

K

fi

ffi

ff

K

fgMgP

iii iiiii

226

)()( '11

1, ' fPfP

ii ff

K

fi

iffii

K

fii

iff

iff

iii

iiiii

gffPffP1

11

,'1

'

11

1,,

1

'11'

1

'11

' )()|()|(

Runtime Direct recurrences for computing forward probabilities

O(nK4) :

Runtime reduced to O(nK3) by reusing common terms:

where )()|( 1

1

1,

'1

'1,,

'1

'11

'11

'1

i

K

f

iffii

iff

iff

gffPi

iiiiii

K

f

iffii

iff

iiiii

ffP1

,1,'1

'1

' )|(

227


View local ancestry inference as a model selection problem Each possible local ancestry defines a factorial

HMM compute for all possible k,l,i,x values Pick model that re-imputes SNPs most

accurately around the locus i. Fixed-window version: pick ancestry that

maximizes the average posterior probability of the SNP genotypes within a fixed-size window centered at the locus

Multi-window version: weighted voting over window sizes between 200-3000, with window weights proportional to average posterior probabilities

klM),|( ,lkii MgxgP

11M 12M 22M

228

Local Ancestry at a locus is an unordered pair of (not necessarily distinct) ancestral populations.

Observations: The local ancestry of a SNP locus is typically shared with

neighboring loci. Small Window sizes may not provide enough

information Large Window sizes may violate local ancestry property

for neighboring loci When using the true values of in ,the accuracy

of SNP genotype imputation within such a neighborhood is typically higher than when using a mis-specified model.

klMlk,


229





Conclusion

230

HMM imputation accuracy Missing data rate and accuracy for imputed

genotypes at different thresholds (WTCCC 58BC/Hapmap CEU)

231

N=2,000g=7

=0.2n=38,864

r=10-8

Window size effect

232

Number of founders effect

CEU-JPTN=2,000

g=7=0.2

n=38,864 r=10-8

233

N=2,000g=7

=0.2n=38,864

r=10-8

Comparison with other methods% of correctly recovered SNP ancestries

234

N=2,000g=7

=0.5n=38,864

r=10-8

Untyped SNP imputation error rate in admixed individuals

235





Conclusion

236

Conclusion-Summary and ongoing work

Imputation-based local ancestry inference achieves significant improvement over previous methods for admixtures between close ancestral populations

Code at http://dna.engr.uconn.edu/software/ Ongoing work

Evaluating accuracy under more realistic admixture scenarios (multiple ancestral populations/gene flow/drift in ancestral populations)

Extension to pedigree data Exploiting inferred local ancestry for more accurate

untyped SNP imputation and phasing of admixed individuals

Extensions to sequencing data Inference of ancestral haplotypes from extant admixed

populations 237

http://dna.engr.uconn.edu/software/

Questions?

238

1. L.E. Baum, T. Petrie, G. Soules, and N.Weiss. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Statist., 41:164{171, 1970.

2. The Wellcome Trust Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447:661{678, 2007.3. Z. Ghahramani and M.I. Jordan. Factorial hidden Markov models. Mach. Learn., 29(2-3):245{273, 1997.

4. J. Kennedy, I.I. Mandoiu, and B. Pasaniuc. Genotype error detection using hidden markov models of haplotype diversity. Journal of Computational Biology, 15(9):1155{1171, 2008.

5. J. Kennedy, B. Pasaniuc, and I.I. Mandoiu. GEDI: Genotype error detection and imputation using hidden markov models of haplotype diversity, manuscript in preparation. software available at at http://dna.engr.uconn.edu/software/gedi/ .

6. G. Kimmel and R. Shamir. A block-free hidden Markov model for genotypes and its application to disease association. Journal of Computational Biology, 12:1243{1260, 2005.

7. Y. Li and G. R. Abecasis. Mach 1.0: Rapid haplotype reconstruction and missing genotype inference. American Journal of Human Genetics, 79:2290, 2006.

8. J. Marchini, C. Spencer, Y.Y. Teo, and P. Donnelly. A bayesian hierarchical mix ture model for genotype calling in a multi-cohort study. in preparation, 2007.

9. B. Pasaniuc, S. Sankararaman, G. Kimmel, and E. Halperin. Inference of locus-specic ancestry in closely related populations (under review).

10. E. J. Parra, A. Marcini, J. Akey, J. Martinson, M. A. Batzer, R. Cooper, T. For-rester, D. B. Allison, R. Deka, R. E. Ferrell, et al. Estimating african american admixture proportions by use of population-specic alleles. Am J Hum Genet, 63(6):1839{1851, December 1998.

11. P. Rastas, M. Koivisto, H. Mannila, and E. Ukkonen. Phasing genotypes using a hidden Markov model. In I.I. Mandoiu and A. Zelikovsky, editors, Bioinformatics Algorithms: Techniques and Applications, pages 355{372. Wiley, 2008.

12. D. Reich and Patterson N. Will admixture mapping work to nd disease genes? Philos Trans R Soc Lond B Biol Sci, 360:1605{1607, 2005.

13. S. Sankararaman, G. Kimmel, E. Halperin, and M.I. Jordan. On the inference of ancestries in admixed populations. Genome Research, (18):668{675, 2008.

14. S. Sankararaman, S. Sridhar, G. Kimmel, and E. Halperin. Estimating local ancestry in admixed populations. American Journal of Human Genetics, 8(2):290{303,2008.

15. P. Scheet and M. Stephens. A fast and exible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypicphase. American Journal of Human Genetics, 78:629{644, 2006.16. R. Schwartz. Algorithms for association study design using a generalized model of haplotype conservation. In Proc. CSB, pages 90{97, 2004.

17. M. W. Smith, N. Patterson, J. A. Lautenberger, A. L. Truelove, G. J. McDonald, A.Waliszewska, B. D. Kessing, M. J. Malasky, C. Scafe, E. Le, et al. A high-density admixture map for disease gene discovery in african americans. Am J Hum Genet,74(5):1001{1013, May 2004.

18. A. Sundquist, E. Fratkin, C.B. Do, and S. Batzoglou. Eect of genetic divergence in identifying ancestral origin using HAPAA. Genome Research, 18(4):676{682,2008.

19. H. Tang, M. Coram, P. Wang, X. Zhu, and N. Risch. Reconstructing genetic ancestry blocks in admixed individuals. Am J Hum Genet, 79:1{12, 2006.

20. H. Tang, Peng J., and Pei Wang P.and Risch N.J. Estimation of individual admixture: Analytical and study design considerations. Genetic Epidemiology, 28:289{

301, 2005. 21. C. Tian, D. A. Hinds, R. Shigeta, R. Kittles, D. G. Ballinger, and M. F. Seldin. A genomewide single-nucleotide-polymorphism panel with high ancestry information for African American admixture mapping. Am J Hum Genet, 79:640{649, 2006. 22. http://www.hapmap.org/.

239

Acknowledgments Work supported in part by NSF awards IIS-0546457

and DBI-0543365.

240

HELPER SLIDES-STARTS NEXT PAGE

241

Introduction- Single Nucleotide Polymorphisms


Human genome density: 1.2x107 out of 3109

base pairs Vast majority bi-allelic 0/1 encoding (major/minor

resp.)

SNP Genotypes are critical to Disease-Gene Mapping One Method: Admixture Mapping




012100120

011000110

001100010


genotype

242

Helper Slide- Other Software1) HMM-based methods:

a) SABERb) SWITCHc) HAPAA

2) Window based Majority vote:a) LAMP (no recombination assumption)b) WINPOP (a more refined model of recombination events coupled with an

adaptive window size computation to achieve increased accuracy.

243

Helper Slide- Probabilities

)[(~)|( g]gg i gPgGPklkl MiiM

:,...,,,...,()[ n1i1-i1i ggg,ggg]g gg

:),...,,,...,( n1i1-i1 gggg ig:}2,1,0{g:}2,1,0{iG

:klM

Random variable genotype at SNP i

Multilocus genotype with I set to

Multilocus genotype without i

Genotype variable taken at SNP i

HMM with ancestral pair k,l

Helper Slide- Emission Details

ig

igli

ki

li

ki

li

ki

hhhh

li

li

ki

kiFF

i FhPFhP}1,0{,

)|()|()(E

245

Input For every Window half-size w

Output (Single Window method) For every i=1..n:

Where:

)|( iiM gGPkl g {0,1,2}g,, klMi

i

klWj

iiMi

Akli gGPW

a )|(||

1maxargˆ 1ig

}1|{ nlkklA

}},min{},...,1{max{ winwiWi


Aai ˆ

246

6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….

…….…………………………..……

k

l

YORP :1 CEUP :2

???????????????????????????????????????????????????????????????????????????????????????

???????????????????????????????????????????????????????????????????????????????????????

i

klWj

iiMi

Akli gGPW

a )|(||

1maxargˆ 1ig


247

??????????????????????????????????????????????????????????????????????????????????????

6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….

…….…………………………..……

k

l

YORP :1 CEUP :2

??????????????????????????????????????????????????????????????????????????????????????

}11..1{

11 )(111maxargˆ

jMAkl GPa

kl 1g 10w }11,..,1{1 W

}11..1{}11..1{}11..1{

221211 111

111

111

jM

jM

jM PPP

1

1


248

6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….

…….…………………………..……

k

l

YORP :1 CEUP :2

10w }12,..,1{2 W

1

1

}12..1{

2222 )|(121maxargˆ

jMAkl gGPa

kl 2g

1

1

}12..1{}12..1{}12..1{

221211 121

121

121

jM

jM

jM PPP

????????????????????????????????????????????????????????????????????????????????????

????????????????????????????????????????????????????????????????????????????????????


249

??????????????????????????????????????????????????????????????????????????????????

??????????????????????????????????????????????????????????????????????????????????

6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….

…….…………………………..……

k

l

YORP :1 CEUP :2

10w }13,..,1{3 W

1

1

}13..1{

3333 )|(131maxargˆ

jMAkl gGPa

kl 3g

1

1

}13..1{}13..1{}13..1{

221211 131

131

131

jM

jM

jM PPP

1

1


250

????????????????????????????????????????????????????????????????????????????????

????????????????????????????????????????????????????????????????????????????????

6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….

…….…………………………..……

k

l

YORP :1 CEUP :2

10w }14,..,1{2 W

1

1

}14..1{

4444 )|(141maxargˆ

jMAkl gGPa

kl 4g

1

1

}14..1{}14..1{}14..1{

221112 141

141

141

jM

jM

jM PPP

1

1

1

2


251

1112222222222222221111111111111122222222222222222222222221111111111111222222

1111111112222222211111111111111112222 ???????????????????????????????????????????

1112222222222222221111111111111122222 ???????????????????????????????????????????

1111111112222222211111111111111112222222222222222111111111111111111111111111

6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….

…….…………………………..……

k

l

YORP :1 CEUP :2

10w }10,..,10{ iiWi

iii Wj

MWj

MWj

M PPP221112 21

1211

211

1

2

i

klWj

iiMAkli gGPa )|(211maxargˆ 1ig


252

Experimental Results- GEDI 1-pop Imputation

1,444 individuals trained on HAPMAP

CEU haplotype reference panel

Imputed (after masking) 1% of SNPs on chromosome 22

253

Helper Slide-Software Overview Only Require information about ancestral allele

frequencies: LAMP WINPOP SWITCH (HMM-Based)

Only require ancestral allele frequencies & genotypes

SABER (HMM-Based) Additionally use ancestral haplotype

information: HAPAA (HMM-Based) GEDI-ADMX (HMM-Based)

254


Similar HMMs proposed by [Kimmel &Shamir 05, Rastas et al. 05, Schwartz 04]

Paths with high transition probability correspond to “founder” haplotypes

Haplotype sequence/paths computed using Viterbi and forward algorithms


Transition Prob

Emission Prob

n= #SNPs(E.g. n=5)

:),'( qqj:);( qjE

n= #SNPs

F1

H1

F2

H2

Fi

Hi

Fn

Hn(Graphical model representation)

255


n= #SNPs

F1

H1

F2

H2

Fi

Hi

Fn



(minor) Given haplotype h, P(H=h|M) can be computed in

O(nK2) using a forward algorithm, where n=#SNPs, K=#founders 256


n= #SNPs

F1

H1

F2

H2

Fi

Hi

Fn


Training: 2- step algorithm that exploits pedigree info

Step 1: Obtain haplotypes from using either: ENT: A pedigree-aware haplotype phasing algorithm based on

entropy-minimization Haplotype reference panel (e.g. HAPMAP)

Step 2: train HMM based on inferred haplotypes, using Baum-Welch 257

Genotype Error Detection- Factorial HMM for multilocus genotype data

n= #SNPs

F1

H1

F2

H2

Fi

Hi

Fn

Hn

F1

H1

F2

H2

Fi

Hi

Fn

Hn

G1 G2 GnGi

258

Genotype Error Detection- Factorial HMM for multilocus trio data

n= #SNPsF1

H1

Fi

Hi

Fn

Hn

F1

H1

Fi

Hi

Fn

Hn

M1 Mi Mn

F1

H1

Fi

Hi

Fn

Hn

F1

H1

Fi

Hi

Fn

Hn

F1 Fi Fn

G1 Gi Gn

259

(Graphical model representation)


(minor) Model training

Based on reference haplotypes using Baum-Welch alg, or Based on unphased genotypes using EM [Rastas et al. 05]

Given haplotype h, P(H=h|M) can be computed in O(nK2) using a forward algorithm, where n=#SNPs, K=#founders

F1

H1

F2

H2

F3

H3

F4

H4

F5

H5

260

F1 F2 Fn…

H1 H2 Hn

F'1 F'2 F'n…

H'1 H'2 H'n

G1 G2 Gn


klM

Random variable for each locus i (i=1..n) Gi = genotype at locus i; values: 0/1/2 (major hom./het./minor

hom.) 261

efficient algorithms for snp genotype data analysis using hidden markov models of haplotype...

Documents

error models

genotype error detectionllr

detection threshold30

haplotype map

conditional probability

high detection accuracy

posterior probability

international hapmap