efficient algorithms for snp genotype data analysis using hidden markov models of haplotype...

259
Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity Justin Kennedy Dissertation Defense for the Degree of Doctorate in Philosophy Computer Science & Engineering Department University of Connecticut 1

Upload: heath

Post on 25-Feb-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity. Justin Kennedy Dissertation Defense for the Degree of Doctorate in Philosophy Computer Science & Engineering Department University of Connecticut. Outline. Introduction - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Efficient Algorithms for SNP Genotype Data Analysis using

HiddenMarkov Models of Haplotype

Diversity

Justin KennedyDissertation Defense for the Degree of Doctorate in Philosophy

Computer Science & Engineering DepartmentUniversity of Connecticut

1

Page 2: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Outline

Introduction

Hidden Markov Models of Haplotype Diversity

Genotype Error Detection using Hidden Markov Models of Haplotype Diversity

Imputation-based Local Ancestry Inference in Admixed Populations

Single Individual Genotyping from Low-Coverage Sequencing Data

Conclusion

2

Page 3: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Introduction-Single Nucleotide Polymorphisms

Main form of variation between individual genomes: Single Nucleotide Polymorphisms (SNPs)

High density in the human genome: 1.3x107 out of 3109 base pairs

Vast majority bi-allelic 0/1 encoding (major/minor resp.)

Haplotype: description of SNP alleles on a chromosome0/1 vector: 0 for major allele, 1 for minor

… ataggtccCtatttcgcgcCgtatacacgggActata …

… ataggtccGtatttcgcgcCgtatacacgggTctata …

… ataggtccCtatttcgcgcGgtatacacgggTctata …

3

Page 4: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection-SNP Genotypes

Diploid: two haplotypes for each chromosome One inherited from mother and one from father

Multilocus Genotype: description of alleles on both chromosomes

0/1/2 vector: 0 (2) - both chromosomes contain the major (minor) allele; 1 - the chromosomes contain different alleles

SNP Genotypes are critical to Disease-Gene Mapping

011000110001100010012100120

+two haplotypes per individual

genotype

4

Page 5: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Introduction- Why SNP Genotypes?

SNPs are the genetic marker of choice for genome wide association studies (GWASs)

GWAS: Method for discovering disease associated genes by typing a dense set of markers in large numbers of cases and controls followed by a statistical test of association.

Ongoing GWASs generate a deluge of genotype data Genetic Association Information Network (GAIN): 6 studies totaling

19,000 individuals typed at 500,000 to 940,000 SNP loci Wellcome Trust Case-Control Consortium (WTCCC): 7 studies

totaling 17,000 individuals typed at 500,000 SNPs WTCCC2: hundreds of thousands of individuals covering over a

million SNPs!5

Page 6: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Introduction-Computational Challenges to Disease Gene Mapping

Genotype error detection: Genotyping errors can decrease statistical power and invalidate statistical tests for disease association based on haplotypes.

Local ancestry Inference: Accurate estimates of local ancestry surrounding disease-associated loci are a critical step in admixture mapping.

Accurate SNP Genotyping from new sequencing

technologies: Accurate determination of both alleles at variable loci is essential, and is limited by coverage depth due to random nature of shotgun sequencing.

6

Page 7: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Outline

Introduction

Hidden Markov Models of Haplotype Diversity

Genotype Error Detection using Hidden Markov Models of Haplotype Diversity

Imputation-based Local Ancestry Inference in Admixed Populations

Single Individual Genotyping from Low-Coverage Sequencing Data

Conclusion 7

Page 8: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Haplotype structure in panmictic populations

8

Page 9: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

HMM of haplotype diversity

Similar models proposed in [Schwartz 04, Rastas et al. 08, Kennedy et al. 07, Kimmel&Shamir 05, Scheet&Stephens 06,…]

Captures Linkage Disequilibrium (LD)

k = 4(# founders)

n = 5(# SNPs)

9

Page 10: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Graphical model representation

Random variables for each locus i (i=1..n) Fi = founder haplotype at locus i; values between 1 and k Hi = observed allele at locus i; values: 0 (major) or 1 (minor)

Model training, based on Baum-Welch algorithm, using: Reference haplotypes from population panel (e.g. Hapmap), or Haplotypes from phased genotype using ENT software

Given haplotype h, P(H=h|M) can be computed in O(nk2) using a forward algorithm.

F2 Fn…

H1 H2 Hn

10

F1

Page 11: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

F1 F2 Fn…

H1 H2 Hn

F'1 F'2 F'n…

H'1 H'2 H'n

G1 G2 Gn

Factorial HMM for genotype data

Random variable for each locus i (i=1..n) Gi = genotype at locus i; values: 0/1/2 (major hom./het./minor hom.) Given multilocus genotype g, P(g|M) can be computed in O(nk4) using

a forward algorithm. 11

Page 12: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

HMM Based Genotype Imputation

Probability of observing genotype at locus i given the known multilocus genotype with missing data at i:

Gi is imputed as:

x

12

)|],...,,,,...,([)],,...,,,...,[|( 111111 MGGxGGPMGGGGxGP niiniii

)|],...,,,,...,([maxarg 111}2,1,0{ MGGxGGP niix

Page 13: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Fi …

Hi

Gi

F’i …

H’i

Forward-backward computation

)()|( '' ''1 ,1 ,, i

iFF

k

Fi

FFi

FF

k

FGMgP

iii iiiii

13

Page 14: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Fi …

Hi

Gi

F’i …

H’i

Forward-backward computation

14

)()|( '' ''1 ,1 ,, i

iFF

k

Fi

FFi

FF

k

FGMgP

iii iiiii

Page 15: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Fi …

Hi

Gi

F’i …

H’i

Forward-backward computation

15

)()|( '' ''1 ,1 ,, i

iFF

k

Fi

FFi

FF

k

FGMgP

iii iiiii

Page 16: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Fi …

Hi

Gi

F’i …

H’i

Forward-backward computation

16

)()|( '' ''1 ,1 ,, i

iFF

k

Fi

FFi

FF

k

FGMgP

iii iiiii

Page 17: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Runtime Direct recurrences for computing forward probabilities

O(nk4) :

Runtime reduced to O(nk3) by reusing common terms:

where

17

)'()(1', iiFF FPFP

ii

k

Fi

iFFiiii

iFF

k

F

iFF

i

iiii

i

iiGFFPFFP

1'1

1',11

1',

1',

1

1111

1

)()'|'()|(

k

F

iFFii

iFF

i

iiiixFFP

1'',1',

1

11)'|'(

k

Fi

iFFii

iFF

iFF

i

iiiiiiGFFPx

1'1

1',1

1',',

1

1111)()'|'(

Page 18: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Speed-up: PopTree Trie

18

Page 19: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Outline

Introduction

Hidden Markov Models of Haplotype Diversity

Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Motivation Likelihood Sensitivity Approach to Error Detection

Hidden Markov Model of Haplotype Diversity

Efficiently Computable Likelihood functions

Experimental Results

Imputation-based Local Ancestry Inference in Admixed Populations

Single Individual Genotyping from Low-Coverage Sequencing Data

Conclusion19

Page 20: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Errors- Motivation

A real problem despite advances in genotyping technology [Zaitlen et al. 2005] found 1.1% inconsistencies among the 20 million dbSNP

genotypes typed multiple times 1% errors decrease power by 10-50% for linkage, and by 5-20% for association

Error types Easily Detectable errors

Systematic errors (e.g., assay failure) detected by HWE test [Hosking et al. 2004] For pedigree data some errors detected as Mendelian Inconsistencies (MIs)

E.g. Only ~30% detectable as MIs for trios [Gordon et al. 1999] Undetected errors

Methods for handling undetected errors: Improved genotype calling algorithms Improved modeling in analysis methods Separate error detection step

Detected errors can be retyped, imputed, or ignored in downstream analysis20

Page 21: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection- Likelihood Sensitivity Approach to Error Detection [Becker et al. 06]

0 1 2 1 0 2

0 2 2 1 0 2

0 2 2 1 0 2

Mother Father

Child

Likelihood of best phasing for original trio T

0 1 1 1 0 0 H10 0 0 1 0 1 H3

0 1 1 1 0 0 H1

0 1 0 1 0 1 H2

0 0 0 1 0 1 H3

0 1 1 1 0 0 H4

)()()()( MAX)( 4321 HpHpHpHpTL

21

Page 22: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection- Likelihood Sensitivity Approach to Error Detection [Becker et al. 06]

0 1 2 1 0 2

0 2 2 1 0 2

0 2 2 1 0 2

Mother Father

Child

Likelihood of best phasing for original trio T

? 0 1 0 1 0 1 H’ 1 0 0 0 1 0 0 H’ 3

0 1 0 1 0 1 H’1

0 1 1 1 0 0 H’2

0 0 0 1 0 0 H’ 3

0 1 1 1 0 1 H’ 4

Likelihood of best phasing for modified trio T’ 22

)()()()( MAX)( 4321 HpHpHpHpTL

)'()'()'()'( MAX)(' 4321 HpHpHpHpTL

Page 23: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

0 1 2 1 0 2

0 2 2 1 0 2

0 2 2 1 0 2

Mother Father

Child

0 1 2 1 0 2

0 2 2 1 0 2

0 2 2 1 0 2

Mother Father

Child

Large change in likelihood suggests likely error Flag genotype as an error if L(T’)/L(T) > R, where R is the

detection threshold (e.g., R=104)

?

Genotype Error Detection- Likelihood Sensitivity Approach to Error Detection [Becker et al. 06]

23

Page 24: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection- Likelihood Sensitivity Approach to Error Detection

[Becker et al. 06] Implementation in FAMHAP Software

Window-based algorithm For each window including the SNP

under test, generate list of H most frequent haplotypes (default H=50)

Find most likely trio phasings by pruned search over the H4 quadruples of frequent haplotypes

Flag genotype as an error if L(T’)/L(T) > R for at least one window

Mother …201012 1 02210...Father …201202 2 10211...Child …000120 2 21021...

24

Page 25: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection- Limitations of FAMHAP

Unbounded list of haplotypes (H=4n) is hard to compute

Truncating H may lead to sub-optimal phasings and inaccurate L(T) values

False positives caused by nearby errors (due to the use of multiple short windows)

Our approach: HMM of haplotype frequencies all haplotypes

represented + no need for short windows Alternate likelihood functions scalable runtime

25

Page 26: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Trio-Based HMM of haplotype diversity

26

F2 Fn…

H1 H2 Hn

F‘1

H'1 H'2 H'n

GM1 GM2 GMn

F1 F2 Fn…

H1 H2 Hn

F‘2 F'n… F’1 F’2 F’n…

H’1 H’2 H’n

GF1 GF2 GFn

GC1 GC2 GCn

F1

Page 27: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection- Alternate Likelihood Functions

Viterbi probability (ViterbiProb): Maximum prob. of a set of 4 HMM paths that emit 4 haplotypes compatible with the trio.

Probability of Viterbi Haplotypes (ViterbiHaps): Obtain the path of the 4 Viterbi haplotypes, then then take product of these individual haplotype probabilities using forward (again).

Total Trio Probability (TotalProb): Total probability P(T) that the HMM emits four haplotypes that explain trio T along all possible 4-tuples of paths.

27

Page 28: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection- Speed Ups from reuse of common terms

Straight-forward approach run time: For a fixed trio, ViterbiProb/TotalProb paths can be found using a 4-path version of

Viterbi’s/Forward algorithm in time For ViterbiHaps, additional traceback to compute probabilities:

K3 speed-up by reuse of common terms: per trio

Likelihoods of all 3n modified trios computed using forward-backward algorithm,

ViterbiProb/TotalProb for m trios: ViterbiHaps:

)( 5nkO

))(( 25 knnkmO

)( 5mnkO

28

)( 8nkO)( 28 knnkO

Page 29: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection- Comparison of Likelihood

Functions

-0.005 0.005 0.0150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

VitHaps-PVitProb-PTotalProb-PVitHaps-CVitProb-CTotalProb-C

FP rate

Sens

itivi

ty

Sensitivity=TP/(TP+TN)

False Positive rate = 1 - TN/(FP+TN)

35 SNPs551 trios[Becker 06]1% err. rate

29

Page 30: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection-“Combined” Detection Method

Compute 4 likelihood ratios

Trio Mother-child duo Father-child duo Child (unrelated)

Flag as error if all ratios are above detection threshold

30

Page 31: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection- Comparison with FAMHAP

0.5

0.6

0.7

0.8

0.9

1

0 0.005 0.01 0.015

Sensitivity

FP rate

#FP=#FN line

TotalProb-TRIO

TotalProb-COMBINEDFAMHAP

Children

0.5

0.6

0.7

0.8

0.9

1

0 0.005 0.01 0.015

SensitivityParents

FP Rate

35 SNPs551 trios[Becker 06]1% err. rate

31

Page 32: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Outline

Introduction

Hidden Markov Models of Haplotype Diversity

Genotype Error Detection using Hidden Markov Models of Haplotype Diversity

Imputation-based Local Ancestry Inference in Admixed Populations Motivation Factorial HMM of genotype data Algorithms for genotype imputation and ancestry inference Experimental results

Single Individual Genotyping from Low-Coverage Sequencing Data

Conclusion 32

Page 33: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Introduction- Population admixture

http://www.garlandscience.co.uk/textbooks/0815341857.asp?type=resources 33

Page 34: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Introduction- Motivation: Admixture mapping

Patterson et al, AJHG 74:979-1000, 2004 34

Page 35: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Introduction- Local ancestry inference problem

rs11095710 T T rs11117179 C T rs11800791 G G rs11578310 G Grs1187611 G Grs11804808 C C rs17471518 A G...

Given: Reference haplotypes for all ancestral populations to be

studied Whole-genome SNP genotype data for extant individual

Find: Allele ancestries at each SNP locus

Reference haplotypes

SNP genotypes

rs11095710 P1 P1rs11117179 P1 P1rs11800791 P1 P1rs11578310 P1 P2rs1187611 P1 P2rs11804808 P1 P2rs17471518 P1 P2...

1110001?0100110010011001111101110111?1111110111000 11100011010011001001100?100101?10111110111?0111000111100100110011010011100101101010111110111101110001110001001000100111110001111011100111?111110111000011101100110011011111100101101110111111111?011000011100010010001001111100010110111001111111110110000011?001?011001101111110010?1011101111111111011000011100110010001001111100011110111001111111110111000

Inferred local ancestry

1110001?0100110010011001111101110111?1111110111000 11100011010011001001100?100101?10111110111?0111000111100100110011010011100101101010111110111101110001110001001000100111110001111011100111?111110111000011101100110011011111100101101110111111111?011000011100010010001001111100010110111001111111110110000011?001?011001101111110010?1011101111111111011000011100110010001001111100011110111001111111110111000

1110001?0100110010011001111101110111?1111110111000 11100011010011001001100?100101?10111110111?0111000111100100110011010011100101101010111110111101110001110001001000100111110001111011100111?111110111000011101100110011011111100101101110111111111?011000011100010010001001111100010110111001111111110110000011?001?011001101111110010?1011101111111111011000011100110010001001111100011110111001111111110111000

35

Page 36: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Introduction- Previous work Two main classes of methods for SNP

Ancestry Inference HMM-based (exploit LD): SABER [Tang et al 06],

SWITCH [Sankararaman et al 08a], HAPAA [Sundquist et al. 08], …

Window-based (unlinked SNP Data): LAMP [Sankararaman et al 08b], WINPOP [Pasaniuc et al. 09]

Limitations Poor accuracy when ancestral populations are

closely related (e.g. Japanese and Chinese) Methods based on unlinked SNPs outperform

methods that model LD!36

Page 37: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

F2 Fn…

H1 H2 Hn

F'1 F'2 F'n…

H'1 H'2 H'n

G1 G2 Gn

Factorial HMM for genotype data in a window with known local ancestry

klM

37

F1

Page 38: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Fixed-window version: pick ancestry that maximizes the average posterior probability of the SNP genotypes within a fixed-size window centered at the locus

Observations: The local ancestry of a SNP locus is typically shared with

neighboring loci. Small Window sizes may not provide enough information Large Window sizes may violate local ancestry property for

neighboring loci

Imputation-based ancestry inference

38

11M 12M 22M

Page 39: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

N=2,000g=7

=0.2n=38,864

r=10-8

Window size effect

39

Page 40: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Multi-window version: Weighted voting over window sizes

between 200-3000, with window weights proportional to average posterior probabilities

Imputation-based ancestry inference

40

Page 41: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

N=2,000g=7

=0.2n=38,864

r=10-8

Comparison with other methods% of correctly recovered SNP ancestries

41

Page 42: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

N=2,000g=7

=0.5n=38,864

r=10-8

Untyped SNP imputation error rate in admixed individuals

42

Page 43: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Imputation- Accuracy with number

founders/runtime5835 SNPs2502 unrelated (CEU)[IMAGE]9% imputed (535 SNPs)

43

3 5 7 9 11 13 1585%

87%

89%

91%

93%

95%

97%

99%

-

500

1,000

1,500

2,000

2,500

3,000

3,500

4,000

Impu

tatio

n Er

ror R

ate

CPU

seco

nds

# Founders

Page 44: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Number of founders effect on Ancestry inference

CEU-JPTN=2,000

g=7=0.2

n=38,864 r=10-8

44

Page 45: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Outline

Introduction

Hidden Markov Models of Haplotype Diversity

Genotype Error Detection using Hidden Markov Models of Haplotype Diversity

Imputation-based Local Ancestry Inference in Admixed Populations

Single Individual Genotyping from Low-Coverage Sequencing Data Motivation Single SNP Calling Algorithms HF-HMM Overview Multilocus HMM Calling Algorithm Experimental Results

Conclusion 45

Page 46: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Low Coverage Genotyping-Next Generation Sequencing (NGS)

By several orders of magnitude, NGS delivers higher throughput of sequencing reads compared to older technologies (e.g. Sanger sequencing)

46

Roche/454 FLX Titanium~1M reads400bp avg. 400-600Mb / run (10h)

ABI SOLiD 3 plus~500M reads/pairs35-50bp25-60Gb / run (3.5-14 days)

Illumina Genome Analyzer IIx~100-300M reads/pairs35-100bp4.5-33 Gb / run (2-10 days)

Page 47: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Low Coverage Genotyping-NGS Applications and Challenges

NGS is enabling many applications, including personal genomics ~$100 million for the Sanger-sequenced Venter genome [Levy et al 07] ~$1 million for sequencing James Watson genome [Wheeler et al 08]

using 454 technology. ~$50K human sequencing now available Thousands more individual genomes to be sequenced as part of

1000 Genomes Project Challenges:

Sequencing requires accurate determination of genetic variation (e.g. SNPs)

Accuracy is limited by coverage depth due to random nature of shotgun sequencing

For the Venter and Watson genomes (both sequenced at ~7.5x average coverage), only 75-80% accuracy achieved for sequencing based calls of heterozygous SNPs [Levy et al 07, Wheeler et al 08].

[Wheeler et al 08] use hypothesis testing based on binomial distribution

47

Page 48: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Low Coverage Genotyping-Do Heuristic Inputs Help?

[Wendl&Wilson 08] predict that 21x coverage is required for sequencing of samples based on the assumption that “neglects any heuristic inputs”

We propose methods incorporating two additional sources of information:

Quality scores reflecting uncertainty in sequencing data

Linkage disequilibrium (LD) information and allele frequencies extracted from reference panels such as Hapmap

48

Page 49: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Low Coverage Genotyping-Pipeline for Single Genotype Calling

49

Page 50: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Single SNP Genotyping- Basic Notations

Biallelic SNPs: 0 = major allele, 1 = minor allele (reads with non-reference alleles are discarded)

SNP genotypes: 0/2 = homozygous major/minor, 1=heterozygous

Read set ri describes the mapped reads for each SNP i

Inferred genotypesMapped reads with allele 0

Mapped reads with allele 1012100120

Sequencing errors50

Page 51: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Applying Bayes’ formula:

Where: is the conditional probability for the read set at

locus i

are genotype frequencies at inferred from a representative panel

Single SNP Genotype Calling

}2,1,0{

)|r()()|r()()r|(

x iii

iiiii xGPxGP

GPGPGP

)( iGP

51

)|r( ii GP

Page 52: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Low Coverage Genotyping-Pipeline for Multilocus Genotyping

Reference haplotypes

52

Page 53: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Multilocus Genotyping-HF-HMM

F1 F2 Fn…

H1 H2 Hn

G1 G2 Gn

…R1,1 R2,1

F'1 F'2 F'n…

H'1 H'2 H'n

R1,c … R2,c …Rn,1 Rn,c1 2 n

HMMs representing LD in populations of origin for mother/father; similar to models used in [Scheet & Stephens 06, Rastas et al 08, Kennedy et al 08]

53

Page 54: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Multilocus Genotyping-HF-HMM Training

Training HMM based on Baum-Welch algorithm from haplotypes inferred from populations of origin for mother/father

Use haplotype reference panel (e.g. HAPMAP) for training Haplotypes

Conditional probabilities for read sets are given by the formulas derived for the single SNP case:

54

1)(r

)(

0)(r

)( )1()0|r(ir

rir

irr

iriiii

GP

0)(r

)(

1)(r

)( )1()2|r(ir

rir

irr

iriiii

GP

ic

ii GP

21)1|r(

Page 55: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Remark: maxgP((G*1,G*2,…,G*n)| r) is hard to approximate within unless ZPP=NP, and thus the multilocus genotyping problem is NP-hard

Multilocus Genotyping Problem

GIVEN: • Shotgun read sets r=(r1, r2, … , rn)• Quality scores• Trained HMMs representing LD in populations of origin for

mother/fatherFIND:

• Multilocus genotype (G*1,G*2,…,G*n) with maximum posterior probability, i.e., (G*1,G*2,…,G*n) argmaxG1..Gn

P(G1…Gn | r)

)( 1 nO

55

Page 56: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Multilocus Genotyping-HMM-Posterior Decoding Algorithm

1. For each i = 1..n, compute

2. Return *)*,...,( 1 nGG)Mr,|(maxarg* iGi GPG

i

56

Page 57: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Forward-Backward Computation of Posterior Probabilities

Fi …

Hi

Gi

…R1,1Ri,1

F’i …

H’i

R1,c …Ri,c …1

i n

57Rn,cRn,1

)()|r()Mr,|( '' ''1 ,1 ,, i

iFF

k

Fi

FFi

FF

k

Fiii GGPGPiii iiiii

Page 58: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Forward-Backward Computation of Posterior Probabilities

Fi …

Hi

…R1,1Ri,1

F’i …

H’i

R1,c …Ri,c …1

i n

58

)()|r()Mr,|( '' ''1 ,1 ,, i

iFF

k

Fi

FFi

FF

k

Fiii GGPGPiii iiiii

Rn,cRn,1

Gi

Page 59: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Fi …

Hi

Gi

…R1,1Ri,1

F’i …

H’i

R1,c …Ri,c …Rn,11

i n

Forward-Backward Computation of Posterior Probabilities

59Rn,c

)()|r()Mr,|( '' ''1 ,1 ,, i

iFF

k

Fi

FFi

FF

k

Fiii GGPGPiii iiiii

Page 60: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Fi …

Hi

Gi

…Ri,1

F’i …

H’i

R1,c …Ri,c … Rn,c1

i

Forward-Backward Computation of Posterior Probabilities

60Rn,1R1,1

)()|r()Mr,|( '' ''1 ,1 ,, i

iFF

k

Fi

FFi

FF

k

Fiii GGPGPiii iiiii

Page 61: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Fi …

Hi

Gi

…R1,1Ri,1

F’i …

H’i

R1,c …Ri,c …1

i n

Forward-Backward Computation of Posterior Probabilities

61Rn,cRn,1

)()|r()Mr,|( '' ''1 ,1 ,, i

iFF

k

Fi

FFi

FF

k

Fiii GGPGPiii iiiii

Page 62: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Multilocus Genotyping-Runtime

Direct implementation gives O(m+nk4) time: m = number of reads n = number of SNPs k = number of founder haplotypes in HMMs

Runtime reduced to O(m+nk3) by reusing common terms:

where

k

F

iFFii

iFF

iFF

iiiiiii

FFP1

1,

'1

'1,,

'1

'11

'11

'1

)|(

k

F

iFFii

iFF

iiiii

FFP1

,1,'1

'1

' )|(

}1,0{

'',

'' )|()|()|()(

iiii

HHiiiiiiiii

iFF

HHGRPFHPFHPG62

Page 63: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Low Coverage Genotyping-Experimental Results- Setup

Subset of James Watson’s 454 reads 74.4M of 106.5M reads

265 bp/read avg coverage: 5.64X

Quality scores included Reads mapped on human genome build 36.3

using the nucmer tool of the MUMmer package [Kurtz et al 04]

Estimated mapping error rates: FP rate: 0.37% FN rate: 21.16%

Haplotype reference panel used to train HMM generated from Hapmap CEU genotypes (release 23a)

63

Page 64: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Accuracy Comparison (Heterozygous Genotypes)

80

82

84

86

88

90

92

94

96

98

100

0 10 20 30 40 50 60 70 80 90 100

% c

orre

ctly

cal

led

% uncalled

1SNP-Posterior Binomial HMM-Posterior

64

Page 65: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Accuracy Comparison (All Genotypes)

93

94

95

96

97

98

99

100

0 20 40 60 80 100

% c

orre

ctly

cal

led

% uncalled

1SNP-Posterior Binomial HMM-Posterior

65

Page 66: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Accuracy at Varying Coverages (All Genotypes)

70

75

80

85

90

95

100

0 10 20 30 40 50 60 70 80 90 100

% uncalled

% c

orre

ctly

cal

led

Posterior 1/16 Posterior 1/8 Posterior 1/4 Posterior 1/2 Posterior 1/1Binomial 1/16 Binomial 1/8 Binomial 1/4 Binomial 1/2 Binomial 1/1

66

Page 67: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Outline

Introduction

Hidden Markov Models of Haplotype Diversity

Genotype Error Detection using Hidden Markov Models of Haplotype Diversity

Imputation-based Local Ancestry Inference in Admixed Populations

Single Individual Genotyping from Low-Coverage Sequencing Data

Conclusion 67

Page 68: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Conclusion

68

Genotype Error Detection Contributions:

Proposed efficient methods for error detection in trio genotype data based on an HMM of haplotype diversity

Can exploit available pedigree info Yield improved detection accuracy compared to FAMHAP Runtime grows linearly in #SNPs and #individuals

Papers: J. Kennedy, I.I. Mandoiu, and B. Pasaniuc. Genotype Error Detection

using Hidden Markov Models of Haplotype diversity. Journal of Computational Biology (to Appear).

J. Kennedy, I.I. Mandoiu and B. Pasaniuc. Genotype Error Detection using Hidden Markov Models of Haplotype Diversity. Proc. WABI 2007, R. Giancarlo and S. Hannenhalli (eds.), LNBI 4645:73-84, 2007

Software: GEDI (Genotype Error Detection and Imputation): J. Kennedy, I.I. Mandoiu and B. Pasaniuc. GEDI: Scalable Algorithms for

Genotype Error Detection and Imputation. ARXIV Report, 2009 Best poster award: J. Kennedy, I.I. Mandoiu and B. Pasaniuc. Genotype Error

Detection and Imputation using Hidden Markov Models of Haplotype Diversity. ISBRA 2008.

Page 69: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Conclusion Imputation-based local ancestry inference in admixed

populations Contributions:

Imputation-based local ancestry inference achieves significant improvement over previous methods for admixtures between close ancestral populations

Future work: Evaluating accuracy under more realistic admixture scenarios (multiple

ancestral populations/gene flow/drift in ancestral populations) Extension to pedigree data Extensions to sequencing data Exploiting inferred local ancestry for phasing of admixed individuals Inference of ancestral haplotypes from extant admixed populations

Papers: B. Pasaniuc and J. Kennedy and I.I. Mandoiu, Imputation-based local

ancestry inference in admixed populations, Proc. 5th International Symposium on Bioinformatics Research and Applications/2nd Workshop on Computational Issues in Genetic Epidemiology, pp. 221-233, 2009

Software: GEDI-ADMX Unix and Windows version (geneAdmixViewer on Windows) 69

Page 70: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Conclusion Genotyping from low coverage sequencing reads

Contributions: Exploiting “heuristic inputs” such as quality scores and population allele

frequency and LD information yields significant improvements in genotyping calling accuracy from low-coverage sequencing data

LD information extracted from a reference panel gives highest benefit Relatively small gain from incorporating quality scores may be due in part to the

poor calibration of 454 quality scores [Brockman et al 08, Quinlan et al 08] Although our evaluation is on 454 reads, the methods are well-suited for short

read technologies Future Work: Population-based Genotyping from Low-Coverage

Sequencing Data Extending the single individual genotyping methods to population sequencing

data (removing the need for reference panels) Use Same HF-HMM as before, only training model based off of EM algorithm

on population-level data provided. Papers:

J. Duitama, S. Dinakar, Y. Hernandez, J. Kennedy, I. Mandoiu, and Y. Wu. Single individual genotyping from low-coverage sequencing data (In Preparation)

Presentation: J.Kennedy. Linkage Disequilibrium Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads. DIMACS workshop on Computational Issues in Genetic Epidemiology.

Software: Gene-Seq70

Page 71: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Questions?

71

Page 72: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Acknowledgments

This work was supported in part by NSF (awards IIS-0546457, DBI-0543365, and CCF-0755373) and by the University of Connecticut Research Foundation

72

Page 73: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Click again for helper slides

73

Page 74: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

HAPMAP: The International HAPMAP Project is an organization

whose goal is to develop a haplotype map of the human genome (the HapMap), which will describe the common patterns of human genetic variation.

HAPMAP is expected to be a key resource for researchers to find genetic variants affecting health, disease and responses to drugs and environmental factors. The information produced by the project is made freely available to researchers around the world.

The International HapMap Project is a collaboration among researchers at academic centers, non-profit biomedical research groups and private companies in Canada, China, Japan, Nigeria, the United Kingdom, and the United States.

74

Page 75: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Baum-Welch overview The algorithm has two steps:

Calculating the forward probability for each HMM state;

Determining the frequency of the transition-emission pair values and dividing it by the probability of the entire sequence. This amounts to calculating the expected count of the particular transition-emission pair. Each time a particular transition is found, the value of the quotient of the transition divided by the probability of the entire sequence goes up, and this value can then be made the new value of the transition.

75

Page 76: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

GEDI Speed-up: PopTree Trie Due to limited genotype variation across individuals of the

same population, additional re-use of forward and backward probability computations corresponding to genotype prefixes (respectively suffixes) shared by multiple genotypes is possible.

GEDI builds PopTree, which is a trie (prefix tree), from the given multilocus genotypes and then computes probabilities by performing a preorder traversal of the trie.

Specifically, the PopTree data structure for unrelated individuals in a population consists of:

Up to n levels, Each node has up to 3 child edges one for each possible

genotype value (0, 1, 2).

76

Page 77: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

GEDI Speed-up: PopTree Speed-up

3 4 5 6 8 10 15 20 25 50 100 -

1,000

2,000

3,000

4,000

5,000

6,000

7,000

7 Pop-Tree13 Pop-Tree7 Slow13 Slow

Time (Seconds)

# Flank-ing

77

Page 78: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Imputation- Accuracy with varying parameters

5835 SNPs2502 unrelated[IMAGE]9% imputed (535 SNPs)

3060

90120

220320

420520

5.00%7.00%9.00%11.00%13.00%15.00%17.00%

3

5

7

9

11

13

15

3 5 7 9

11 13 15

#Founders

# Training Haplo-

types

% Error Rate

78

Page 79: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection- Experimental Results (Setup)

Real dataset [Becker et al. 2006] 35 SNP loci covering a region of 91kb 551 trios

Synthetic datasets 35 SNPs, 551 trios Preserved missing data pattern of real

dataset Haplotypes assigned to trios based on

frequencies inferred from real dataset 1% error rate using random allele insertion

model79

Page 80: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Error Detection Accuracy on Unrelated Genotype Data

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.005 0.01 0.015

Sens

itivi

ty

FP rate

#FP=#FN line

Len=10Kb (U)

Len=100Kb (U)

Len=1Mb (U)

Len=10Mb (U)

551 unrelated individuals Recombination & mutation rates of 10-8 per generation

per bp 35 SNPs within a region of 10kb-10Mb

Genotype Error Detection- Effect of Distance between SNPs

80

Page 81: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection- TrioProb-Combined Results on Real Dataset

[Becker et al. 06] resequenced all trio members at 41 loci flagged by FAMHAP-3

26 SNP genotypes in 23 trios were identified as true errors 41*3-26=97 resequenced SNP genotypes agree with

original calls (or are unknown)

  Total Signals True Positives False Positives Unknown

FP Rate 1%.5% .1% 1%.5% .1% 1%.5% .1% 1%.5% .1%

Parents 218 127 69 9 9 8 1 0 0 208 118 91

Children 104 74 24 11 11 11 3 3 2 90 60 11

Total 322 201 93 20 20 19 4 3 2 298 178 72

81

Page 82: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection- Error Model Comparison

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.005 0.01 0.015

FP rate

Sens

itivi

ty

random allele (P)random geno (P)hetero-to-homo (P)homo-to-hetero (P)random allele (C)random geno (C)hetero-to-homo (C)homo-to-hetero (C)

82

Page 83: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection- Distribution of LLR for Total Trio Prob.

Parents-TRIOS

1

10

100

1000

10000

100000

1000000

0

0.27

0.54

0.81

1.08

1.35

1.62

1.89

2.16

2.43 2.

7

2.97

3.24

3.51

3.78

4.05

4.32

4.59

4.86

5.13 5.

4

5.67

5.94

NO_ERR ERR

Children-TRIOS

1

10

100

1000

10000

100000

1000000

0

0.27

0.54

0.81

1.08

1.35

1.62

1.89

2.16

2.43 2.

7

2.97

3.24

3.51

3.78

4.05

4.32

4.59

4.86

5.13 5.

4

5.67

5.94

NO_ERR ERR

Same-locus errors in parents

35 SNPs551 trios[Becker 06]1% err. rate

83

Page 84: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection-LLR for Combined Method

Parents-COMBINED

1

10

100

1000

10000

100000

1000000

0

0.27

0.54

0.81

1.08

1.35

1.62

1.89

2.16

2.43 2.7

2.97

3.24

3.51

3.78

4.05

4.32

4.59

4.86

5.13 5.4

5.67

5.94

NO_ERR ERR

Children-COMBINED

1

10

100

1000

10000

100000

1000000

0

0.28

0.56

0.84

1.12 1.4

1.68

1.96

2.24

2.52 2.8

3.08

3.36

3.64

3.92 4.2

4.48

4.76

5.04

5.32 5.6

5.88

NO_ERR ERR

35 SNPs551 trios[Becker 06]1% err. rate

84

Page 85: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection- Effect of Population Size

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.005 0.01 0.015

FP rate

Sens

itivi

ty

n=551 (P)n=129 (P)n=30 (P)n=551 (C)n=129 (C)n=30 (C)

85

Page 86: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Imputation-Effect of flanking size

3 4 5 6 8 10 15 20 25 50 1000.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

-

10,000

20,000

30,000

40,000

50,000

7 Founders- Error Rate13 Founders- Error Rate7 Founders- Time13 Founders- Time

Error Rate

Time (Seconds)

# Flanking

86

Page 87: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Imputation-Effect of pedigree data/haplotypes

30 60 90 120 220 320 420 5205.00%

6.00%

7.00%

8.00%

9.00%

10.00%

11.00%

12.00%

13.00%

14.00%

Trios 7-founders

Unrelated 7-founders

Unrelated 13-founders

Trios 13-founders

Error Rate

#Training Haplotypes

87

Page 88: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Introduction- Single Nucleotide Polymorphisms

Main form of variation between individual genomes: Single Nucleotide Polymorphisms (SNPs)

Human genome density: 1.2x107 out of 3109

base pairs Vast majority bi-allelic 0/1 encoding (major/minor

resp.)

SNP Genotypes are critical to Disease-Gene Mapping One Method: Admixture Mapping

… ataggtccCtatttcgcgcCgtatacacgggActata …

… ataggtccGtatttcgcgcCgtatacacgggTctata …

… ataggtccCtatttcgcgcGgtatacacgggTctata …

012100120

011000110

001100010

+two haplotypes per individual

genotype

88

Page 89: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Helper Slide-Software Overview Only Require information about ancestral allele

frequencies: LAMP (No recombination assumption) WINPOP (a more refined model of recombination

events coupled with an adaptive window size computation to achieve increased accuracy)

SWITCH (HMM-Based) Only require ancestral allele frequencies &

genotypes SABER (HMM-Based)

Additionally use ancestral haplotype information:

HAPAA (HMM-Based) GEDI-ADMX (HMM-Based) 89

Page 90: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Helper Slide- Probabilities

)[(~)|( g]gg i gPgGPklkl MiiM

:,...,,,...,()[ n1i1-i1i ggg,ggg]g gg

:),...,,,...,( n1i1-i1 gggg ig:}2,1,0{g:}2,1,0{iG

:klM

Random variable genotype at SNP i

Multilocus genotype with I set to

Multilocus genotype without i

Genotype variable taken at SNP i

HMM with ancestral pair k,l

Page 91: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Helper Slide- Emission Details

ig

igli

ki

li

ki

li

ki

hhhh

li

li

ki

kiFF

i FhPFhP}1,0{,

)|()|()(E

91

Page 92: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Input For every Window half-size w

Output (Single Window method) For every i=1..n:

Where:

)|( iiM gGPkl g {0,1,2}g,, klMi

i

klWj

iiMi

Akli gGPW

a )|(||

1maxargˆ 1ig

}1|{ nlkklA

}},min{},...,1{max{ winwiWi

Window-based local ancestry inference

Aai ˆ

92

Page 93: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….

…….…………………………..……

k

l

YORP :1 CEUP :2

???????????????????????????????????????????????????????????????????????????????????????

???????????????????????????????????????????????????????????????????????????????????????

i

klWj

iiMi

Akli gGPW

a )|(||

1maxargˆ 1ig

Window-based local ancestry inference

93

Page 94: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

??????????????????????????????????????????????????????????????????????????????????????

6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….

…….…………………………..……

k

l

YORP :1 CEUP :2

??????????????????????????????????????????????????????????????????????????????????????

}11..1{

11 )(111maxargˆ

jMAkl GPa

kl 1g 10w }11,..,1{1 W

}11..1{}11..1{}11..1{

221211 111

111

111

jM

jM

jM PPP

1

1

Window-based local ancestry inference

94

Page 95: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….

…….…………………………..……

k

l

YORP :1 CEUP :2

10w }12,..,1{2 W

1

1

}12..1{

2222 )|(121maxargˆ

jMAkl gGPa

kl 2g

1

1

}12..1{}12..1{}12..1{

221211 121

121

121

jM

jM

jM PPP

????????????????????????????????????????????????????????????????????????????????????

????????????????????????????????????????????????????????????????????????????????????

Window-based local ancestry inference

95

Page 96: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

??????????????????????????????????????????????????????????????????????????????????

??????????????????????????????????????????????????????????????????????????????????

6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….

…….…………………………..……

k

l

YORP :1 CEUP :2

10w }13,..,1{3 W

1

1

}13..1{

3333 )|(131maxargˆ

jMAkl gGPa

kl 3g

1

1

}13..1{}13..1{}13..1{

221211 131

131

131

jM

jM

jM PPP

1

1

Window-based local ancestry inference

96

Page 97: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

????????????????????????????????????????????????????????????????????????????????

????????????????????????????????????????????????????????????????????????????????

6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….

…….…………………………..……

k

l

YORP :1 CEUP :2

10w }14,..,1{2 W

1

1

}14..1{

4444 )|(141maxargˆ

jMAkl gGPa

kl 4g

1

1

}14..1{}14..1{}14..1{

221112 141

141

141

jM

jM

jM PPP

1

1

1

2

Window-based local ancestry inference

97

Page 98: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

1112222222222222221111111111111122222222222222222222222221111111111111222222

1111111112222222211111111111111112222 ???????????????????????????????????????????

1112222222222222221111111111111122222 ???????????????????????????????????????????

1111111112222222211111111111111112222222222222222111111111111111111111111111

6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….

…….…………………………..……

k

l

YORP :1 CEUP :2

10w }10,..,10{ iiWi

iii Wj

MWj

MWj

M PPP221112 21

1211

211

1

2

i

klWj

iiMAkli gGPa )|(211maxargˆ 1ig

Window-based local ancestry inference

98

Page 99: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Experimental Results- GEDI 1-pop Imputation

1,444 individuals trained on HAPMAP

CEU haplotype reference panel

Imputed (after masking) 1% of SNPs on chromosome 22

99

Page 100: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Binomial Distribution

n: # of successive trials (reads)p: the probability of a success (correct map)1-p: =q , theprobability of a failure (incorrect map)

To call a heterozygous genotype, each allele must be covered by at least one read and the binomial probability for the observed number of 0 and 1 alleles must be at least 0.01

100

Page 101: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

To determine quality scores, Phred first calculates several parameters related to peak shape and peak resolution at each base. Phred then uses these parameters to look up a corresponding quality score in huge lookup tables. These lookup tables were generated from sequence traces where the correct sequence was known, and are hard coded in Phred; different lookup tables are used for different sequencing chemistries and machines

Phred Score

101

Page 102: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Conditional Probability for Heterozygous Genotypes

i

i

c

riririi GP

21

21)1(

21)1|r(

r)()(

102

Page 103: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Baum-Welch overview The algorithm has two steps:

Calculating the forward probability for each HMM state;

Determining the frequency of the transition-emission pair values and dividing it by the probability of the entire sequence. This amounts to calculating the expected count of the particular transition-emission pair. Each time a particular transition is found, the value of the quotient of the transition divided by the probability of the entire sequence goes up, and this value can then be made the new value of the transition.

103

Page 104: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Model Training- Details

Initial founder probabilities P(f1), P(f’1), transition probabilities P(fi+1|fi), P(f’i+1|f’i), and emission probabilities P(hi|fi), P(h’i|f’i) trained using the Baum-Welch algorithm from haplotypes inferred from the populations of origin for mother/father

P(gi|hi,h’i) set to 1 if h+h’i=gi and to 0 otherwise

This implies that conditional probabilities for sets of reads are given by the formulas derived for the single SNP case:

1)(r

)(

0)(r

)( )1()0|r(ir

rir

irr

iriiii

GP

0)(r

)(

1)(r

)( )1()2|r(ir

rir

irr

iriiii

GP

ic

ii GP

21)1|r(

)(1)(

)()(

)()(

)(1)(, 1

221

2)|( ir

irir

iriir

irir

iri

iijigggGrRP

104

Page 105: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Implementation Details

iii

iiii

ghhhh

iiiiii

fffhPfhPg

'

''

}1,0{,

'',

)|()|()(

Forward recurrences:

)()( '11

1, ' fPfP

ii ff

K

fi

iffii

K

fii

iff

iff

iii

iiiii

gffPffP1

11

,11

11

,,1

'11'

1

'11

' )()|()|(

Backward recurrences are similar 105

Page 106: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Allele coverage for heterozygous SNPs (Watson 454 @ 5.85x avg. coverage)

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

Reference allele coverage

Varia

nt a

llele

cov

erag

e

Page 107: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Allele coverage for heterozygous SNPs (Watson 454 @ 2.93x avg. coverage)

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

Reference allele coverage

Varia

nt a

llele

cov

erag

e

Page 108: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Allele coverage for heterozygous SNPs (Watson 454 @ 1.46x avg. coverage)

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

Reference allele coverage

Varia

nt a

llele

cov

erag

e

Page 109: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Single SNP Genotyping- Incorporating Base Call Uncertainty

Let ri denote the set of mapped reads covering SNP locus i and ci=| ri |

For a read r in ri , r(i) denotes the allele observed at locus i If qr(i) is the phred quality score of r(i), the probability that r(i)

is incorrect is given by The probability of observing read set ri conditional on

having genotype gi is then given by:

1)(r

)(

0)(r

)( )1()0|r(ir

rir

irr

iriiii

GP

10/)(

)(10 irqir

0)(r

)(

1)(r

)( )1()2|r(ir

rir

irr

iriiii

GP

ic

ii GP

21)1|r(

111

Page 110: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Experimental Results-Read Data

Subset of James Watson’s 454 reads 74.4 million reads with quality scores (of 106.5 million

reads used in [Wheeler et al 08]) downloaded from ftp://ftp.ncbi.nih.gov/pub/TraceDB/Personal_Genomics/Watson/

Average read length ~265 bpRead length distribution

0

500000

1000000

1500000

2000000

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 112

Page 111: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Experimental Results-Read Data

Reads mapped on human genome build 36.3 using the nucmer tool of the MUMmer package [Kurtz et al 04]

Default nucmer parameters (MUM size 20, min cluster size 65, max gap between adjacent matches 90)

Additional filtering: at least 90% of the read length matched to the genome, no more than 10 errors (mismatches or indels)

Reads meeting above conditions at multiple genome positions (likely coming from genomic repeats) were discarded

Simulated 454 reads generated using ReadSim [Schmid et al 07] were used to estimate mapping error rates:

FP rate: 0.37% FN rate: 21.16%

113

Page 112: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Experimental Results-Read Data

Average coverage by mapped reads of Hapmap SNPs was 5.64x

Lower than [Wheeler et al 08] since we start with a subset of the reads and use more stringent mapping constraints

SNP coverage by mapped reads

0

100000

200000

300000

400000

500000

600000

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

114

Page 113: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Experimental Results-Read Data

CEU genotypes from latest Hapmap release (23a) were dowloaded from http://ftp.hapmap.org/genotypes/latest_ncbi_build36/forward/non-redundant/

Genotypes were phased using the ENT algorithm [Gusev et al 08] and inferred haplotypes were used to train the parent HMMs using Baum-Welch

Duplicate Affymetrix 500k SNP genotypes were downloaded from ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/by_series/GSE10668/GSE10668_family.soft.gz

We removed genotypes that were discordant in the two replicates and genotypes for which Hapmap and Affymetrix annotations had more than 5% in CEU same-strand allele frequency

115

Page 114: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Accuracy Comparison (Homozygous Genotypes)

97

97.5

98

98.5

99

99.5

100

0 10 20 30 40 50 60 70 80 90 100

% uncalled

% c

orre

ctly

cal

led

1SNP-Posterior Binomial0.01 HMM-Posterior

116

Page 115: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Gene-Seq Algorithms Posterior Decoding (see presentation) Greedy:

Markov Approximation:

Composite 2-SNP Viterbi Posterior Decoding, version 2

117

Page 116: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

118

Page 117: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Introduction Helper Linkage Analysis: Study aimed at establishing linkage between

genes. Today linkage analysis serves as a way of gene-hunting and genetic testing.

Genetic linkage is the tendency for genes and other genetic markers to be inherited together because of their location near one another on the same chromosome.

Relative Risk is the risk of an event (or of developing a disease) relative to exposure.

Relative risk is a ratio of the probability of the event occurring in the exposed group versus a non-exposed group.

For example, if the probability of developing lung cancer among smokers was 20% and among non-smokers 1%, then the relative risk of cancer associated with smoking would be 20. Smokers would be twenty times as likely as non-smokers to develop lung cancer.

Association Analysis: Test for association between a genetic variation (e.g. SNP) and one or more quantitative traits

119

Page 118: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Association analysis

Genome-wide scans made possible by recent progress in Single Nucleotide Polymorphism (SNP) genotyping technologies

Introduction-Disease Gene Mapping

Linkage analysis

Very successful for Mendelian diseases (cystic fibrosis, Huntington’s,…)

Low power to detect genes with small relative risk in complex diseases [RischMerikangas’96]

Cases Controls

120

Page 119: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection-Motivation

Even low error levels can have large effects for some study designs (e.g. rare alleles, haplotype-based)

Errors as low as .1% can increase Type I error rates in haplotype sharing transmission disequilibrium test (HS-TDT) [Knapp&Becker04]

1% errors decrease power by 10-50% for linkage, and by 5-20% for association [Douglas et al. 00, Abecasis et al. 01]

121

Page 120: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Improved genotype calling algorithms [Marchini et al. 07, Nicolae et al. 06, Rabbee&Speed 05, Xiao

et al. 07] Explicit modeling in analysis methods

[Cheng 07, Hao & Wang 04, Liu et al. 07] Computationally complex

Separate error detection step Detected errors can be retyped, imputed, or ignored in

downstream analyses Common approach in pedigree genotype data analysis

[Abecasis et al. 02, Douglas et al. 00, Sobel et al. 02]

Genotype Error Detection-Motivation

122

Page 121: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Complexity of Computing Maximum Phasing Probability

• For unrelated genotypes, computing maximum phasing probability is hard to approximate within a factor of O(f½-) unless ZPP=NP, where f is the number of founders• For trios, hard to approx. within O(f1/4 -)• Reductions from the clique problem

123

Page 122: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

NGS Applications

Besides reducing costs of de novo genome sequencing, NGS has found many more apps:

Resequencing, transcriptomics (RNA-Seq), gene regulation (non-coding RNAs, transcription factor binding sites using ChIP-Seq), epigenetics (methylation, nucleosome modifications), metagenomics, paleogenomics, …

NGS is enabling personal genomics James Watson genome [Wheeler et al 08] sequenced

using 454 technology for ~$1 million compared to ~$100 million for the Sanger-sequenced Venter genome [Levy et al 07]

Thousands more individual genomes to be sequenced as part of 1000 Genomes Project 124

Page 123: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Challenges in Medical Applications of Sequencing

Medical sequencing focuses on genetic variation (SNPs, CNVs, genome rearrangements)

Requires accurate determination of both alleles at variable loci

Accuracy is limited by coverage depth due to random nature of shotgun sequencing

For the Venter and Watson genomes (both sequenced at ~7.5x average coverage), only 75-80% accuracy achieved for sequencing based calls of heterozygous SNPs [Levy et al 07, Wheeler et al 08].

[Wheeler et al 08] use hypothesis testing based on binomial distribution

To call a heterozygous genotype each allele must be covered by at least one read and the binomial probability for the observed number of 0 and 1 alleles must be at least 0.01

[Wendl&Wilson 08] predict that 21x coverage will be required for sequencing of normal tissue samples based on idealized theory that “neglects any heuristic inputs”

125

Page 124: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Prior Methods for Calling SNP Genotypes from Read Data

Prior methods are all based on allele coverage [Levy et al 07] require that each allele be covered by at

least 2 reads in order to be called [Wheeler et al 08] use hypothesis testing based on the

binomial distribution To call a heterozygous genotype must have each allele

covered by at least one read and the binomial probability for the observed number of 0 and 1 alleles must be at least 0.01

[Wendl&Wilson 08] generalize these methods by allowing an arbitrary minimum allele coverage k

126

Page 125: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Linkage Disequilibrium-Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads S. Dinakar1, Y. Hernandez2, J. Kennedy1, I. Mandoiu1, and Y. Wu11CSE Department, University of Connecticut 2Department of Computer Science, Hunter College

Motivation• Medical sequencing focuses on genetic variation (SNPs, CNVs,…)

– Requires determination of both alleles at variable loci; however, this is limited by coverage depth due to random nature of shotgun sequencing

– For the Venter and Watson genomes (both sequenced at ~7.5x average coverage), comparison with SNP genotyping chips shows ~75% accuracy for sequencing based calls of heterozygous SNPs [Levy et al 07, Wheeler et al 08]

• [Wendl&Wilson 08] estimate that 21x coverage will be required for medical sequencing of normal tissue samples – Based on idealized theory that “neglects heuristic inputs”

• We propose genotype calling methods that exploit following heuristic inputs– Quality scores reflecting uncertainty in sequencing data– Allele frequency and linkage disequilibrium info extracted from reference

panels such as Hapmap• Do heuristic inputs help?

– Experiments on a subset of James Watson 454 reads show that accuracy achieved by the binomial test of [Wheeler et al. 08] for 5.6x mapped read coverage is achieved by our methods using less than 1/4 of the reads

• Applying Bayes’ formula:

– Where are allele frequencies inferred from a representative panel

Single SNP Genotype Calling• Let ri denote the arbitrarily ordered set of mapped reads

covering SNP locus i and ci=| ri |– For a read r in ri , r(i) denotes the allele observed at locus i– If qr(i) is the phred quality score of r(i), the probability that r(i) is incorrect

is given by

• The probability of observing read set ri conditional on having genotype Gi:

1)(r

)(

0)(r

)( )1()0|r(ir

rir

irr

iriiii

GP

10/)(

)(10 irqir

0)(r

)(

1)(r

)( )1()2|r(ir

rir

irr

iriiii

GP ic

ii GP

21)1|r(

}2,1,0{)|r()(

)|r()()r|(g iiii

iiiiiii gGPgGP

gGPgPgGP

)( ii gGP

F1 F2 Fn…

H1 H2 Hn

G1 G2 Gn

…R1,1 R2,1

F'1 F'2 F'n…

H'1 H'2 H'n

R1,c … R2,c …Rn,1 Rn,c1 2 n

HMMs representing LD in populations of origin for mother/father; similar to models used in [Scheet & Stephens 06, Rastas et al. 08, Kennedy et al. 08]

Multilocus Model

• Initial founder probabilities P(f1), P(f’1), transition probabilities P(f i+1|fi), P(f’i+1|f’i), and emission probabilities P(hi|fi), P(h’i|f’i) trained using the Baum-Welch algorithm from haplotypes inferred from the populations of origin for mother/father

• P(gi|hi,h’i) set to 1 if h+h’i=gi and to 0 otherwise

– This implies that conditional probabilities for sets of reads are given by the formulas derived for the single SNP case:

Model Training

)(1)(

)()(

)()(

)(1)(, 1

221

2)|( ir

irir

iriir

irir

iri

iijigggGrRP

1)(r

)(

0)(r

)( )1()0|r(ir

rir

irr

iriiii

GP

0)(r

)(

1)(r

)( )1()2|r(ir

rir

irr

iriiii

GP ic

ii GP

21)1|r(

• Joint probabilities can be computed using a forward-backward algorithm:

• Direct implementation gives O(m+nK4) time, where – m = number of reads– n = number of SNPs– K = number of founder haplotypes in HMMs

• Runtime reduced to O(m+nK3) using speed-up idea similar to [Rastas et al. 08, Kennedy et al. 08]

)()|r()r,( '' ''1 ,1 ,, i

iff

K

fi

ffi

ff

K

fiii ggPgPiii iiiii

Posterior Decoding Algorithm1. For each i = 1..n, compute2. Return *)*,...,(* 1 nggg

)r,(maxarg)r|(maxarg* igigi gPgPgii

• Read data– 74.4 million reads with quality scores (of 106.5 million reads used in [Wheeler et

al 08]) downloaded from ftp://ftp.ncbi.nih.gov/pub/TraceDB/Personal_Genomics/Watson/

– Average read length ~265 bp

• Reference population data– CEU genotypes from latest Hapmap release (23a) from

http://ftp.hapmap.org/genotypes/latest_ncbi_build36/forward/non-redundant/

– Genotypes were phased using the ENT algorithm [Gusev et al. 08] and inferred haplotypes were used to train the parent HMM models using the Baum-Welch algorithm

• Genotype data– Duplicate Affymetrix 500k SNP genotypes downloaded from

ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/by_series/GSE10668/GSE10668_family.soft.gz

– We removed genotypes that were discordant in the two replicates and genotypes for which Hapmap and Affymetrix annotations had more than 5% in CEU same-strand allele frequency

Experimental Setup

• Reads mapped on human genome build 36.3 using the nucmertool of the MUMmer package [Kurtz et al 04]– Default nucmer parameters:

• MUM size 20, min cluster size 65, max gap between adjacent matches 90– Additional filtering

• At least 90% of the read length matched to the genome, no more than 10 errors (mismatches or indels)

– Reads meeting above conditions at multiple genome positions were discarded

• These reads are coming from genomic repeats, and is difficult to accurately map them

• Simulated 454 reads generated using ReadSim [Schmid et al 07] were used to estimate mapping error rates:– Estimated mapping FP rate is 0.37%

Read Mapping Procedure

Hapmap SNP coverage by mapped reads

0

100000

200000

300000

400000

500000

600000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

• Average coverage by mapped reads of Hapmap SNPs was 5.64x– Lower than [Wheeler et al 08] since we start with a subset of the reads

and use more stringent mapping constraints

Read Mapping Results

Homozygous Genotypes

97

97.5

98

98.5

99

99.5

100

0 10 20 30 40 50

% uncalled

% c

orre

ctly

cal

led

1SNP-Posterior Binomial0.01 HMM-Posterior

Genotyping Accuracy at 5.64x Read Coverage

Heterozygous Genotypes

80

82

84

86

88

90

92

94

96

98

100

0 10 20 30 40 50

% uncalled

% c

orre

ctly

cal

led

1SNP-Posterior Binomial0.01 HMM-Posterior

Conclusions & Ongoing Work• Exploiting “heuristic inputs” such as quality scores, population allele

frequency, and LD information yields significant improvements in genotyping calling accuracy from low-coverage sequencing data– Improvement depends on the coverage depth (higher at lower coverage)

• Accuracy achieved by binomial test of [Wheeler et al. 08] for 5.6x mapped read coverage is achieved by HMM-based posterior decoding using less than 1/4 of the reads; accuracy achieved by the binomial test for 2.8x coverage is achieved using less than 1/8 of the reads

– Small gain from incorporating quality scores may be due to poor calibration of 454 quality scores [Brockman et al. 08, Quinlan et al. 08]

– Although the evaluation is on 454 reads, our methods are well-suited for sequencing technologies with shorter reads

• Ongoing work includes modeling ambiguities in read mapping and extending the methods to population sequencing data (removing the need for reference panels)

ACKNOWLEDGEMENTSThis work was supported in part by NSF under awards IIS-0546457 and DBI-0543365 to IM and IIS-0803440 to YW. SD and YH performed this research as part of the Summer REU program “Bio-Grid Initiatives for Interdisciplinary Research and Education" funded by NSF under award CCF-0755373.

Genotyping Accuracy at Lower Read Coverage

70

75

80

85

90

95

100

0 10 20 30 40 50 60 70 80 90 100

% uncalled

% c

orre

ctly

cal

led

HMM-Posterior 1/16 HMM-Posterior 1/8 HMM-Posterior 1/4 HMM-Posterior 1/2 HMM-Posterior 1/1Binomial0.01 1/16 Binomial0.01 1/8 Binomial0.01 1/4 Binomial0.01 1/2 Binomial0.01 1/1

>gi|88943037|ref|NT_113796.1|Hs1_111515 Homo sapiens chromosome 1 genomic contig, reference assemblyGAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAGCTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTGGATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGTAATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCCCTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATATTTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGAAATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAGTCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGAATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC

>gi|88943037|ref|NT_113796.1|Hs1_111515 Homo sapiens chromosome 1 genomic contig, reference assemblyGAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAGCTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTGGATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGTAATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCCCTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATATTTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGAAATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAGTCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGAATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC

>gnl|ti|1779718824 name:EI1W3PE02ILQXT28 28 28 28 26 28 28 40 34 14 44 36 23 13 2 27 42 35 21 727 42 35 21 6 28 43 36 22 10 27 42 35 20 6 28 43 36 22 928 43 36 22 9 28 44 36 24 14 4 28 28 28 27 28 26 26 35 2640 34 18 3 28 28 28 27 33 24 26 28 28 28 40 33 14 28 36 2726 26 37 29 28 28 28 28 27 28 28 28 37 28 27 27 28 36 28 3728 28 28 27 28 28 28 24 28 28 27 28 28 37 29 36 27 27 28 2728 33 23 28 33 23 28 36 27 33 23 28 35 25 28 28 36 27 36 2728 28 28 24 28 37 29 28 19 28 26 37 29 26 39 33 13 37 28 2828 21 24 28 27 41 34 15 28 36 27 26 28 24 35 27 28 40 34 15

>gnl|ti|1779718824 name:EI1W3PE02ILQXT28 28 28 28 26 28 28 40 34 14 44 36 23 13 2 27 42 35 21 727 42 35 21 6 28 43 36 22 10 27 42 35 20 6 28 43 36 22 928 43 36 22 9 28 44 36 24 14 4 28 28 28 27 28 26 26 35 2640 34 18 3 28 28 28 27 33 24 26 28 28 28 40 33 14 28 36 2726 26 37 29 28 28 28 28 27 28 28 28 37 28 27 27 28 36 28 3728 28 28 27 28 28 28 24 28 28 27 28 28 37 29 36 27 27 28 2728 33 23 28 33 23 28 36 27 33 23 28 35 25 28 28 36 27 36 2728 28 28 24 28 37 29 28 19 28 26 37 29 26 39 33 13 37 28 2828 21 24 28 27 41 34 15 28 36 27 26 28 24 35 27 28 40 34 15

>gnl|ti|1779718824 name:EI1W3PE02ILQXTTCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTTGTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATTCTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGTTAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC>gnl|ti|1779718825 name:EI1W3PE02GTXK0TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTTTAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGCAGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCAGGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA

>gnl|ti|1779718824 name:EI1W3PE02ILQXTTCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTTGTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATTCTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGTTAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC>gnl|ti|1779718825 name:EI1W3PE02GTXK0TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTTTAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGCAGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCAGGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA

Proposed Pipeline for Genotype Calling from Short Reads

Mapped reads

Hapmap genotypes90 20934216 F 0 02110001?0100210010011002122201210211?122122021200018 F 15 1621100012010021001001100?100201?10111110111?021200015 M 0 0211200100120012010011200101101010111110111102120007 M 0 02110001001000200122110001111011100111?1212102220008 F 0 0011202100120022012211200101101210211122111?012000012 F 9 10211000100100020012211000101101110011121212102200009 M 0 0011?001?012002201221120010?1012102111221111012000011 M 7 821100210010002001221100012110111001112121210222000

90 20934216 F 0 02110001?0100210010011002122201210211?122122021200018 F 15 1621100012010021001001100?100201?10111110111?021200015 M 0 0211200100120012010011200101101010111110111102120007 M 0 02110001001000200122110001111011100111?1212102220008 F 0 0011202100120022012211200101101210211122111?012000012 F 9 10211000100100020012211000101101110011121212102200009 M 0 0011?001?012002201221120010?1012102111221111012000011 M 7 821100210010002001221100012110111001112121210222000

90 20934216 F 0 02110001?0100210010011002122201210211?122122021200018 F 15 1621100012010021001001100?100201?10111110111?021200015 M 0 0211200100120012010011200101101010111110111102120007 M 0 02110001001000200122110001111011100111?1212102220008 F 0 0011202100120022012211200101101210211122111?012000012 F 9 10211000100100020012211000101101110011121212102200009 M 0 0011?001?012002201221120010?1012102111221111012000011 M 7 821100210010002001221100012110111001112121210222000

Reference genome sequence

>gi|88943037|ref|NT_113796.1|Hs1_111515 Homo sapiens chromosome 1 genomic contig, reference assemblyGAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAGCTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTGGATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGTAATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCCCTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATATTTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGAAATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAGTCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGAATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC

… …

>gnl|ti|1779718824 name:EI1W3PE02ILQXTTCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTTGTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATTCTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGTTAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC>gnl|ti|1779718825 name:EI1W3PE02GTXK0TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTTTAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGCAGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCAGGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA

>gnl|ti|1779718824 name:EI1W3PE02ILQXT28 28 28 28 26 28 28 40 34 14 44 36 23 13 2 27 42 35 21 727 42 35 21 6 28 43 36 22 10 27 42 35 20 6 28 43 36 22 928 43 36 22 9 28 44 36 24 14 4 28 28 28 27 28 26 26 35 2640 34 18 3 28 28 28 27 33 24 26 28 28 28 40 33 14 28 36 2726 26 37 29 28 28 28 28 27 28 28 28 37 28 27 27 28 36 28 3728 28 28 27 28 28 28 24 28 28 27 28 28 37 29 36 27 27 28 2728 33 23 28 33 23 28 36 27 33 23 28 35 25 28 28 36 27 36 2728 28 28 24 28 37 29 28 19 28 26 37 29 26 39 33 13 37 28 2828 21 24 28 27 41 34 15 28 36 27 26 28 24 35 27 28 40 34 15

Read sequences

Quality scores

SNP genotype callsrs12095710 T T 9.988139e-01rs12127179 C T 9.986735e-01rs11800791 G G 9.977713e-01rs11578310 G G 9.980062e-01rs1287622 G G 8.644588e-01 rs11804808 C C 9.977779e-01rs17471528 A G 5.236099e-01rs11804835 C C 9.977759e-01rs11804836 C C 9.977925e-01rs1287623 G G 9.646510e-01 rs13374307 G G 9.989084e-01rs12122008 G G 5.121655e-01rs17431341 A C 5.290652e-01rs881635 G G 9.978737e-01 rs9700130 A A 9.989940e-01 rs11121600 A A 6.160199e-01rs12121542 A A 5.555713e-01rs11121605 T T 8.387705e-01rs12563779 G G 9.982776e-01rs11121607 C G 5.639239e-01rs11121608 G T 5.452936e-01rs12029742 G G 9.973527e-01rs562118 C C 9.738776e-01 rs12133533 A C 9.956655e-01rs11121648 G G 9.077355e-01rs9662691 C C 9.988648e-01 rs11805141 C C 9.928786e-01rs1287635 C C 6.113270e-01

Page 126: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

ORIGINAL PROPOSAL PRESENTATION NEXT

Page 127: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Efficient Algorithms for SNP Genotype Data Analysis using

HiddenMarkov Models of Haplotype

Diversity

Justin KennedyDissertation Proposal for the Degree of Doctorate in Philosophy

Computer Science & Engineering DepartmentUniversity of Connecticut

129

Page 128: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Outline

Introduction

Genotype Error Detection using Hidden Markov Models of Haplotype Diversity

Single Individual Genotyping from Low-Coverage Sequencing Data

Conclusion130

Page 129: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Introduction-Single Nucleotide Polymorphisms

Main form of variation between individual genomes: Single Nucleotide Polymorphisms (SNPs)

High density in the human genome: 1.2x107 out of 3109 base pairs

Vast majority bi-allelic 0/1 encoding (major/minor resp.)

SNP Genotypes are critical to Disease-Gene Mapping

… ataggtccCtatttcgcgcCgtatacacgggActata …

… ataggtccGtatttcgcgcCgtatacacgggTctata …

… ataggtccCtatttcgcgcGgtatacacgggTctata …

131

Page 130: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Introduction- Why SNP Genotypes?

Single Nucleotide Polymorphisms (SNPs) have become the genetic marker of choice for genome wide association studies (GWASs)

GWAS: Method for mapping disease associated genes by typing a dense set of markers in large numbers of cases and controls followed by a statistical test of association. Provides Higher statistical power compared to other gene mapping methods such as linkage for uncovering genetic basis of complex diseases

Ongoing GWASs generate a deluge of genotype data Genetic Association Information Network (GAIN): 6 studies totaling 18,000

individuals typed at 500,000 to 940,000 SNP loci Wellcome Trust Case-Control Consortium (WTCCC): 7 studies totaling

17,000 individuals typed at 500,000 SNP

Major concern: quality of genotype data132

Page 131: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Introduction-Computational Challenges to Disease Gene Mapping

Genotype error detection: Low levels of genotyping errors can decrease statistical power and invalidate statistical tests for disease association based on haplotypes

Handling structural variation data

provided by new sequencing technologies: Accurate determination of both alleles at variable loci is essential, and is limited by coverage depth due to random nature of shotgun sequencing

133

Page 132: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Outline

Introduction Genotype Error Detection using Hidden Markov Models of

Haplotype Diversity Motivation Likelihood Sensitivity Approach to Error Detection

Hidden Markov Model of Haplotype Diversity

Efficiently Computable Likelihood functions

Experimental Results Single Individual Genotyping from Low-Coverage

Sequencing Data Conclusion

134

Page 133: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection-Motivation

A real problem despite advances in genotyping technology [Zaitlen et al. 2005] found 1.1% inconsistencies among the 20 million dbSNP

genotypes typed multiple times 1% errors decrease power by 10-50% for linkage, and by 5-20% for

association [Douglas et al. 00, Abecasis et al. 01] Error types

Systematic errors (e.g., assay failure) detected by HWE test [Hosking et al. 2004]

For pedigree data some errors detected as Mendelian Inconsistencies (MIs) E.g. Only ~30% detectable as MIs for trios [Gordon et al. 1999]

Undetected errors Methods for Handling Undetected errors:

Improved genotype calling algorithms [Marchini et al. 07,, Xiao et al. 07] Explicit modeling in analysis methods [Cheng 07, Hao & Wang 04, Liu et al.

07] Separate error detection step

Detected errors can be retyped, imputed, or ignored in downstream analyses

135

Page 134: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection-Haplotypes and Genotypes

Haplotype: description of SNP alleles on a chromosome 0/1 vector: 0 for major allele, 1 for minor

Diploids: two homologous copies of each autosomal chromosome

One inherited from mother and one from father Genotype: description of alleles on both chromosomes

0/1/2 vector: 0 (1) - both chromosomes contain the major (minor) allele; 2 - the chromosomes contain different alleles

011000110001100010021200210

+two haplotypes per individual

genotype136

Page 135: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection- Likelihood Sensitivity Approach to Error Detection [Becker et al. 06]

0 1 2 1 0 2

0 2 2 1 0 2

0 2 2 1 0 2

Mother Father

Child

Likelihood of best phasing for original trio T

0 1 1 1 0 0 h10 0 0 1 0 1 h3

0 1 1 1 0 0 h1

0 1 0 1 0 1 h2

0 0 0 1 0 1 h3

0 1 1 1 0 0 h4

)()()()( MAX)( 4321 hphphphpTL

137

Page 136: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection- Likelihood Sensitivity Approach to Error Detection [Becker et al. 06]

0 1 2 1 0 2

0 2 2 1 0 2

0 2 2 1 0 2

Mother Father

Child

Likelihood of best phasing for original trio T

)()()()( MAX)( 4321 hphphphpTL

? 0 1 0 1 0 1 h’ 1 0 0 0 1 0 0 h’ 3

0 1 0 1 0 1 h’1

0 1 1 1 0 0 h’2

0 0 0 1 0 0 h’ 3

0 1 1 1 0 1 h’ 4

Likelihood of best phasing for modified trio T’

)'()'()'()'( MAX)'( 4321 hphphphpTL 138

Page 137: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

0 1 2 1 0 2

0 2 2 1 0 2

0 2 2 1 0 2

Mother Father

Child

0 1 2 1 0 2

0 2 2 1 0 2

0 2 2 1 0 2

Mother Father

Child

?

Large change in likelihood suggests likely error Flag genotype as an error if L(T’)/L(T) > R, where R is the

detection threshold (e.g., R=104)

Genotype Error Detection- Likelihood Sensitivity Approach to Error Detection [Becker et al. 06]

139

Page 138: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection- Likelihood Sensitivity Approach to Error Detection

[Becker et al. 06] Implementation in FAMHAP Software

Window-based algorithm For each window including the SNP

under test, generate list of H most frequent haplotypes (default H=50)

Find most likely trio phasings by pruned search over the H4 quadruples of frequent haplotypes

Flag genotype as an error if L(T’)/L(T) > R for at least one window

Mother …201012 1 02210...Father …201202 2 10211...Child …000120 2 21021...

140

Page 139: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection- Limitations of FAMHAP

Truncating the list of haplotypes to size H may lead to sub-optimal phasings and inaccurate L(T) values

False positives caused by nearby errors (due to the use of multiple short windows)

Our approach: HMM of haplotype frequencies all haplotypes

represented + no need for short windows Alternate likelihood functions scalable runtime

141

Page 140: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection-Hidden Markov Model of Haplotype Diversity

Similar HMMs proposed by [Kimmel &Shamir 05, Rastas et al. 05, Schwartz 04]

Paths with high transition probability correspond to “founder” haplotypes

Haplotype sequence/paths computed using Viterbi and forward algorithms

K= #Founders(E.g. K=4)

Transition Prob

Emission Prob

N= #SNPs(E.g. N=5)

:),'( qqj:);( qjE

142

Page 141: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection-Hidden Markov Model of Haplotype Diversity

Training: 2- step algorithm that exploits pedigree info

Step 1: Obtain haplotypes from using either: ENT: A pedigree-aware haplotype phasing algorithm based on

entropy-minimization Haplotype reference panel (e.g. HAPMAP)

Step 2: train HMM based on inferred haplotypes, using Baum-Welch

:),'( qqj

K= #Founders(E.g. K=4)

Transition Prob

Emission Prob

N= #SNPs(E.g. N=5)

:);( qjE

143

Page 142: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection- Alternate Likelihood Functions

• Viterbi probability (ViterbiProb): Maximum probability of a set of 4 HMM paths that emit 4 haplotypes compatible with the trio

• Probability of Viterbi Haplotypes (ViterbiHaps): Product of total probabilities of the 4 Viterbi haplotypes

• Total Trio Probability (TotalProb): Total probability P(T) that the HMM emits four haplotypes that explain trio T along all possible 4-tuples of paths

),'()';1({max);();( 41'

qqqqqq jQ

jVjEjVj

4

1'

)},'()';1({);();(jQ

jjTjEjTq

qqqqq 144

Page 143: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection- Speed Up of Viterbi Probability

For a fixed trio, Viterbi paths can be found using a 4-path version of Viterbi’s algorithm in time

K3 speed-up by reuse of common terms (similar to [Rastas et al. 05]):

)( 8NKO

)},'()',,,;1({max),,,;(),,,;1( 4443213'43214321 14qqqqqqjSqqqqjEqqqqjV jQq j

)},'()',',,;1({max)',,,;( 3343212'43213 13qqqqqqjSqqqqjS jQq j

)},'()',',',;1({max)',',,;( 2243211'43212 12qqqqqqjSqqqqjS jQq j

)},'()',',',';1({max)',',',;( 114321'43211 11qqqqqqjVqqqqjS jQq j

145

Page 144: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection- Overall Function Runtimes

Viterbi probability Likelihoods of all 3N modified trios can be computed

within time using forward-backward algorithm Overall runtime for M trios

Probability of Viterbi haplotypes Obtain haplotypes from standard traceback, then

compute haplotype probabilities using forward algorithms

Overall runtime Total trio probability

Similar pre-computation speed-up & forward-backward algorithm

Overall runtime

))(( 25 KNNKMO

)( 5MNKO

)( 5MNKO)( 5NKO

146

Page 145: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection- Experimental Results (Setup)

Real dataset [Becker et al. 2006] 35 SNP loci covering a region of 91kb 551 trios

Synthetic datasets 35 SNPs, 551 trios Preserved missing data pattern of real

dataset Haplotypes assigned to trios based on

frequencies inferred from real dataset 1% error rate using random allele insertion

model147

Page 146: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection- Comparison of Likelihood

Functions

-0.005 0.005 0.0150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

VitHaps-PVitProb-PTotalProb-PVitHaps-CVitProb-CTotalProb-C

FP rate

Sens

itivi

ty

Sensitivity=TP/(TP+TN)

False Positive rate = 1 - TN/(FP+TN) 148

Page 147: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Distribution of Log-Likelihood Ratios for TotalTrioProb

Parents-TRIOS

1

10

100

1000

10000

100000

1000000

0

0.27

0.54

0.81

1.08

1.35

1.62

1.89

2.16

2.43 2.

7

2.97

3.24

3.51

3.78

4.05

4.32

4.59

4.86

5.13 5.

4

5.67

5.94

NO_ERR ERR

Children-TRIOS

1

10

100

1000

10000

100000

1000000

0

0.27

0.54

0.81

1.08

1.35

1.62

1.89

2.16

2.43 2.

7

2.97

3.24

3.51

3.78

4.05

4.32

4.59

4.86

5.13 5.

4

5.67

5.94

NO_ERR ERR

Same-locus errors in parents

149

Page 148: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection-“Combined” Detection Method

Compute 4 likelihood ratios

Trio Mother-child duo Father-child duo Child (unrelated)

Flag as error if all ratios are above detection threshold

150

Page 149: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Distribution of Log-Likelihood Ratios for Combined Method

Parents-COMBINED

1

10

100

1000

10000

100000

1000000

0

0.27

0.54

0.81

1.08

1.35

1.62

1.89

2.16

2.43 2.7

2.97

3.24

3.51

3.78

4.05

4.32

4.59

4.86

5.13 5.4

5.67

5.94

NO_ERR ERR

Children-COMBINED

1

10

100

1000

10000

100000

1000000

0

0.28

0.56

0.84

1.12 1.4

1.68

1.96

2.24

2.52 2.8

3.08

3.36

3.64

3.92 4.2

4.48

4.76

5.04

5.32 5.6

5.88

NO_ERR ERR

151

Page 150: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Comparison with FAMHAP (Children)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.005 0.01 0.015

Sens

itivi

ty

FP rate

#FP=#FN line

TotalProb-TRIO

TotalProb-COMBINED

FAMHAP

152

Page 151: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Comparison with FAMHAP (Parents)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.005 0.01 0.015

Sens

itivi

ty

FP rate

#FP=#FN line

TotalProb-TRIO

TotalProb-COMBINED

FAMHAP

153

Page 152: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Outline

Introduction Genotype Error Detection using Hidden Markov

Models of Haplotype Diversity Single Individual Genotyping from Low-Coverage

Sequencing Data Motivation Single SNP Calling Algorithm Multilocus HMM Calling Algorithm Experimental Results

Conclusion

154

Page 153: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Low Coverage Genotyping-Next Generation Sequencing (NGS)

Illumina / Solexa Genetic Analyzer 1G1000 Mb/run, 35bp reads

Roche / 454 Genome Sequencer FLX100 Mb/run, 400bp reads

Applied BiosystemsSOLiD3000 Mb/run, 25-35bp reads

By several orders of magnitude, NGS delivers higher throughput of sequencing reads compared to older technologies (e.g. Sanger sequencing)

More improvements expected in quest for $1,000 genome

155

Page 154: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Low Coverage Genotyping-NGS Applications and Challenges

NGS is enabling many applications, including personal genomics ~$1 million for sequencing James Watson genome [Wheeler et al 08]

using 454 technology. ~$100 million for the Sanger-sequenced Venter genome [Levy et al 07] Thousands more individual genomes to be sequenced as part of

1000 Genomes Project Challenges:

Sequencing requires accurate determination of genetic variation (e.g. SNPs)

Accuracy is limited by coverage depth due to random nature of shotgun sequencing

For the Venter and Watson genomes (both sequenced at ~7.5x average coverage), only 75-80% accuracy achieved for sequencing based calls of heterozygous SNPs [Levy et al 07, Wheeler et al 08].

[Wheeler et al 08] use hypothesis testing based on binomial distribution [Wendl&Wilson 08] predict that 21x coverage will be required for

sequencing of normal tissue samples based on idealized theory that “neglects any heuristic inputs”

156

Page 155: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Low Coverage Genotyping-Do Heuristic Inputs Help?

We propose methods incorporating two additional sources of information:

Quality scores reflecting uncertainty in sequencing data

Linkage disequilibrium (LD) information and allele frequencies extracted from reference panels such as Hapmap

Experiments on a subset of the James Watson 454 reads show that our methods yield improved genotyping accuracy

Improvement depends on the coverage depth (higher at lower coverage), e.g., accuracy achieved by the binomial test of [Wheeler et al. 08] for 5.6-fold mapped read coverage is achieved by our methods using less than 1/4 of the reads 157

Page 156: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Low Coverage Genotyping-Pipeline for Single Genotype Calling

158

Page 157: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Single SNP Genotyping- Basic Notations

Biallelic SNPs: 0 = major allele, 1 = minor allele (reads with non-reference alleles are discarded)

SNP genotypes: 0/2 = homozygous major/minor, 1=heterozygous

Inferred genotypesMapped reads with allele 0

Mapped reads with allele 1012100120

Sequencing errors159

Page 158: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Single SNP Genotyping- Incorporating Base Call Uncertainty

Let ri denote the set of mapped reads covering SNP locus i and ci=| ri |

For a read r in ri , r(i) denotes the allele observed at locus i If qr(i) is the phred quality score of r(i), the probability that r(i)

is incorrect is given by The probability of observing read set ri conditional on

having genotype Gi is then given by:

1)(r

)(

0)(r

)( )1()0|r(ir

rir

irr

iriiii

GP

10/)(

)(10 irqir

0)(r

)(

1)(r

)( )1()2|r(ir

rir

irr

iriiii

GP

ic

ii GP

21)1|r(

160

Page 159: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Single SNP Genotype Calling

Applying Bayes’ formula:

Where are allele frequencies inferred from a representative panel

}2,1,0{)|r()(

)|r()()r|(g iiii

iiiiiii gGPgGP

gGPgPgGP

)( ii gGP

161

Page 160: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Low Coverage Genotyping-Pipeline for Multilocus Genotyping

162

Page 161: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Multilocus Genotyping-HF-HMM

F1 F2 Fn…

H1 H2 Hn

G1 G2 Gn

…R1,1 R2,1

F'1 F'2 F'n…

H'1 H'2 H'n

R1,c … R2,c …Rn,1 Rn,c1 2 n

HMMs representing LD in populations of origin for mother/father; similar to models used in [Scheet & Stephens 06, Rastas et al 08, Kennedy et al 08]

163

Page 162: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Multilocus Genotyping-HF-HMM Training

Training HMM based on Baum-Welch algorithm from haplotypes inferred from populations of origin for mother/father

Use haplotype reference panel (e.g. HAPMAP) for training Haplotypes

Conditional probabilities for read sets are given by the formulas derived for the single SNP case:

1)(r

)(

0)(r

)( )1()0|r(ir

rir

irr

iriiii

GP

0)(r

)(

1)(r

)( )1()2|r(ir

rir

irr

iriiii

GP

ic

ii GP

21)1|r( 164

Page 163: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Remark: maxgP(g | r) is hard to approximate within unless ZPP=NP, and thus the multilocus genotyping problem is NP-hard

Multilocus Genotyping Problem

GIVEN: • Shotgun read sets r=(r1, r2, … , rn)• Quality scores• Trained HMMs representing LD in populations of origin for

mother/fatherFIND:

• Multilocus genotype g*=(g*1,g*2,…,g*n) with maximum posterior probability, i.e., g*=argmaxg P(g | r)

)( 1 nO

165

Page 164: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Multilocus Genotyping-HMM-Posterior Decoding Algorithm

1. For each i = 1..n, compute

2. Return *)*,...,(* 1 nggg

)r,(maxarg)r|(maxarg* igigi gPgPgii

166

Page 165: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Forward-Backward Computation of Posterior Probabilities

)()|r()r,( '' ''1 ,1 ,, i

iff

K

fi

ffi

ff

K

fiii ggPgPiii iiiii

fi …

hi

gi

…r1,1ri,1

f’i …

h’i

r1,c …ri,c …Rn,1 Rn,c

1i n

167

Page 166: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Forward-Backward Computation of Posterior Probabilities

)()|r()r,( '' ''1 ,1 ,, i

iff

K

fi

ffi

ff

K

fiii ggPgPiii iiiii

fi …

hi

gi

…r1,1ri,1

f’i …

h’i

r1,c …ri,c …Rn,1 Rn,c

1i n

168

Page 167: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

)()|r()r,( '' ''1 ,1 ,, i

iff

K

fi

ffi

ff

K

fiii ggPgPiii iiiii

fi …

hi

gi

…r1,1ri,1

f’i …

h’i

r1,c …ri,c …Rn,1 Rn,c

1i n

Forward-Backward Computation of Posterior Probabilities

169

Page 168: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

)()|r()r,( '' ''1 ,1 ,, i

iff

K

fi

ffi

ff

K

fiii ggPgPiii iiiii

fi …

hi

gi

…r1,1ri,1

f’i …

h’i

r1,c …ri,c …Rn,1 Rn,c

1i n

Forward-Backward Computation of Posterior Probabilities

170

Page 169: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

)()|r()r,( '' ''1 ,1 ,, i

iff

K

fi

ffi

ff

K

fiii ggPgPiii iiiii

fi …

hi

gi

…r1,1ri,1

f’i …

h’i

r1,c …ri,c …Rn,1 Rn,c

1i n

Forward-Backward Computation of Posterior Probabilities

171

Page 170: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Multilocus Genotyping-Runtime

Direct implementation gives O(m+nK4) time: m = number of reads n = number of SNPs K = number of founder haplotypes in HMMs

Runtime reduced to O(m+nK3) by reusing common terms:

where

K

f

iffii

iff

iff

iiiiiii

ffP1

1,

'1

'1,,

'1

'11

'11

'1

)|(

K

f

iffii

iff

iiiii

ffP1

,1,'1

'1

' )|(

}1,0{,

'',

'' )|()|()|()(

iiii

hhiiiiiiiii

iff

hhGrPfhPfhPg172

Page 171: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Low Coverage Genotyping-Experimental Results- Setup

Subset of James Watson’s 454 reads 74.4M of 106.5M reads

265 bp/read avg coverage: 5.64X

Quality scores included Reads mapped on human genome build 36.3

using the nucmer tool of the MUMmer package [Kurtz et al 04]

Estimated mapping error rates: FP rate: 0.37% FN rate: 21.16%

Haplotype reference panel used to train HMM generated from Hapmap CEU genotypes (release 23a)

173

Page 172: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Accuracy Comparison (Heterozygous Genotypes)

80

82

84

86

88

90

92

94

96

98

100

0 10 20 30 40 50 60 70 80 90 100

% uncalled

% c

orre

ctly

cal

led

1SNP-Posterior Binomial0.01 HMM-Posterior

174

Page 173: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Accuracy Comparison (All Genotypes)

93

94

95

96

97

98

99

100

0 20 40 60 80 100

% uncalled

% c

orre

ctly

cal

led

1SNP-Posterior Binomial0.01 HMM-Posterior

175

Page 174: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Accuracy at Varying Coverages (All Genotypes)

70

75

80

85

90

95

100

0 10 20 30 40 50 60 70 80 90 100

% uncalled

% c

orre

ctly

cal

led

Posterior 1/16 Posterior 1/8 Posterior 1/4 Posterior 1/2 Posterior 1/1Binomial 1/16 Binomial 1/8 Binomial 1/4 Binomial 1/2 Binomial 1/1

176

Page 175: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Outline

Introduction

Genotype Error Detection using Hidden Markov Models of Haplotype Diversity

Single Individual Genotyping from Low-Coverage Sequencing Data

Conclusion177

Page 176: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Conclusion Genotype Error Detection

Contributions: Proposed efficient methods for error detection in trio genotype data

based on an HMM of haplotype diversity Can exploit available pedigree info Yield improved detection accuracy compared to FAMHAP Runtime grows linearly in #SNPs and #individuals

Papers/Software: J. Kennedy, I.I. Mandoiu, and B. Pasaniuc. Genotype Error Detection

using Hidden Markov Models of Haplotype diversity. Journal of Computational Biology (to Appear).

J. Kennedy, I.I. Mandoiu and B. Pasaniuc. Genotype Error Detection using Hidden Markov Models of Haplotype Diversity. Proc. WABI 2007, R. Giancarlo and S. Hannenhalli (eds.), LNBI 4645:73-84, 2007

J. Kennedy, I.I. Mandoiu and B. Pasaniuc. Genotype Error Detection using Hidden Markov Models of haplotype diversity. In 3rd RECOMB Satellie Workshop on: Computational Methods for SNPs and Haplotypes, 2007

Software: GEDI (Genotype Error Detection and Imputation): Best poster award: J. Kennedy, I.I. Mandoiu and B. Pasaniuc. Genotype

Error Detection and Imputation using Hidden Markov Models of Haplotype Diversity. ISBRA 2008.

178

Page 177: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Conclusion Genotyping from low coverage sequencing reads

Contributions: Exploiting “heuristic inputs” such as quality scores and population

allele frequency and LD information yields significant improvements in genotyping calling accuracy from low-coverage sequencing data

LD information extracted from a reference panel gives highest benefit Relatively small gain from incorporating quality scores may be due in

part to the poor calibration of 454 quality scores [Brockman et al 08, Quinlan et al 08]

Although our evaluation is on 454 reads, the methods are well-suited for short read technologies

Papers/Software: S. Dinakar, Y. Hernandez, J. Kennedy, I. Mandoiu, and Y. Wu. Single

individual genotyping from low-coverage sequencing data (In Preparation)

Presentation: J.Kennedy. Linkage Disequilibrium Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads. DIMACS workshop on Computational Issues in Genetic Epidemiology.

Software: Gene-Seq179

Page 178: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Conclusion Future Work: Population-based

Genotyping from Low-Coverage Sequencing Data Extending the single individual

genotyping methods to population sequencing data (removing the need for reference panels)

Use Same HF-HMM as before, only training model based off of EM algorithm on population-level data provided. 180

Page 179: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Questions?

181

Page 180: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Acknowledgments

This work was supported in part by NSF (awards IIS-0546457, DBI-0543365, and CCF-0755373) and by the University of Connecticut Research Foundation

182

Page 181: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Click again for helper slides

183

Page 182: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

HAPMAP: The International HAPMAP Project is an organization

whose goal is to develop a haplotype map of the human genome (the HapMap), which will describe the common patterns of human genetic variation.

HAPMAP is expected to be a key resource for researchers to find genetic variants affecting health, disease and responses to drugs and environmental factors. The information produced by the project is made freely available to researchers around the world.

The International HapMap Project is a collaboration among researchers at academic centers, non-profit biomedical research groups and private companies in Canada, China, Japan, Nigeria, the United Kingdom, and the United States.

184

Page 183: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Baum-Welch overview The algorithm has two steps:

Calculating the forward probability and the backward probability for each HMM state;

Determining the frequency of the transition-emission pair values and dividing it by the probability of the entire sequence. This amounts to calculating the expected count of the particular transition-emission pair. Each time a particular transition is found, the value of the quotient of the transition divided by the probability of the entire sequence goes up, and this value can then be made the new value of the transition.

185

Page 184: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Error Detection Accuracy on Unrelated Genotype Data

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.005 0.01 0.015

Sens

itivi

ty

FP rate

#FP=#FN line

Len=10Kb (U)

Len=100Kb (U)

Len=1Mb (U)

Len=10Mb (U)

551 unrelated individuals Recombination & mutation rates of 10-8 per generation

per bp 35 SNPs within a region of 10kb-10Mb

186

Page 185: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

TrioProb-Combined Results on Real Dataset

[Becker et al. 06] resequenced all trio members at 41 loci flagged by FAMHAP-3

26 SNP genotypes in 23 trios were identified as true errors 41*3-26=97 resequenced SNP genotypes agree with

original calls (or are unknown)

  Total Signals True Positives False Positives Unknown

FP Rate 1%.5% .1% 1%.5% .1% 1%.5% .1% 1%.5% .1%

Parents 218 127 69 9 9 8 1 0 0 208 118 91

Children 104 74 24 11 11 11 3 3 2 90 60 11

Total 322 201 93 20 20 19 4 3 2 298 178 72

187

Page 186: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Error Model Comparison

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.005 0.01 0.015

FP rate

Sens

itivi

ty

random allele (P)random geno (P)hetero-to-homo (P)homo-to-hetero (P)random allele (C)random geno (C)hetero-to-homo (C)homo-to-hetero (C)

188

Page 187: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Effect of Population Size

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.005 0.01 0.015

FP rate

Sens

itivi

ty

n=551 (P)n=129 (P)n=30 (P)n=551 (C)n=129 (C)n=30 (C)

189

Page 188: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Binomial Distribution

n: # of successive trials (reads)p: the probability of a success (correct map)1-p: =q , theprobability of a failure (incorrect map)

To call a heterozygous genotype, each allele must be covered by at least one read and the binomial probability for the observed number of 0 and 1 alleles must be at least 0.01

190

Page 189: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Conditional Probability for Heterozygous Genotypes

i

i

c

riririi GP

21

21)1(

21)1|r(

r)()(

191

Page 190: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Model Training- Details

Initial founder probabilities P(f1), P(f’1), transition probabilities P(fi+1|fi), P(f’i+1|f’i), and emission probabilities P(hi|fi), P(h’i|f’i) trained using the Baum-Welch algorithm from haplotypes inferred from the populations of origin for mother/father

P(gi|hi,h’i) set to 1 if h+h’i=gi and to 0 otherwise

This implies that conditional probabilities for sets of reads are given by the formulas derived for the single SNP case:

1)(r

)(

0)(r

)( )1()0|r(ir

rir

irr

iriiii

GP

0)(r

)(

1)(r

)( )1()2|r(ir

rir

irr

iriiii

GP

ic

ii GP

21)1|r(

)(1)(

)()(

)()(

)(1)(, 1

221

2)|( ir

irir

iriir

irir

iri

iijigggGrRP

192

Page 191: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Implementation Details

iii

iiii

ghhhh

iiiiii

fffhPfhPg

'

''

}1,0{,

'',

)|()|()(

Forward recurrences:

)()( '11

1, ' fPfP

ii ff

K

fi

iffii

K

fii

iff

iff

i

iii

iiiigffPffP

11

1,1

11

1,,

1

'11'

1

'11

' )()|()|(

Backward recurrences are similar 193

Page 192: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Experimental Results-Read Data

Subset of James Watson’s 454 reads 74.4 million reads with quality scores (of 106.5 million

reads used in [Wheeler et al 08]) downloaded from ftp://ftp.ncbi.nih.gov/pub/TraceDB/Personal_Genomics/Watson/

Average read length ~265 bpRead length distribution

0

500000

1000000

1500000

2000000

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 194

Page 193: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Experimental Results-Read Data

Reads mapped on human genome build 36.3 using the nucmer tool of the MUMmer package [Kurtz et al 04]

Default nucmer parameters (MUM size 20, min cluster size 65, max gap between adjacent matches 90)

Additional filtering: at least 90% of the read length matched to the genome, no more than 10 errors (mismatches or indels)

Reads meeting above conditions at multiple genome positions (likely coming from genomic repeats) were discarded

Simulated 454 reads generated using ReadSim [Schmid et al 07] were used to estimate mapping error rates:

FP rate: 0.37% FN rate: 21.16%

195

Page 194: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Experimental Results-Read Data

Average coverage by mapped reads of Hapmap SNPs was 5.64x

Lower than [Wheeler et al 08] since we start with a subset of the reads and use more stringent mapping constraints

SNP coverage by mapped reads

0

100000

200000

300000

400000

500000

600000

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

196

Page 195: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Experimental Results-Read Data

CEU genotypes from latest Hapmap release (23a) were dowloaded from http://ftp.hapmap.org/genotypes/latest_ncbi_build36/forward/non-redundant/

Genotypes were phased using the ENT algorithm [Gusev et al 08] and inferred haplotypes were used to train the parent HMMs using Baum-Welch

Duplicate Affymetrix 500k SNP genotypes were downloaded from ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/by_series/GSE10668/GSE10668_family.soft.gz

We removed genotypes that were discordant in the two replicates and genotypes for which Hapmap and Affymetrix annotations had more than 5% in CEU same-strand allele frequency

197

Page 196: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Accuracy Comparison (Homozygous Genotypes)

97

97.5

98

98.5

99

99.5

100

0 10 20 30 40 50 60 70 80 90 100

% uncalled

% c

orre

ctly

cal

led

1SNP-Posterior Binomial0.01 HMM-Posterior

198

Page 197: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Gene-Seq Algorithms Posterior Decoding (see presentation) Greedy:

Markov Approximation:

Composite 2-SNP Viterbi Posterior Decoding, version 2

199

Page 198: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

200

Page 199: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Introduction Helper Linkage Analysis: Study aimed at establishing linkage between

genes. Today linkage analysis serves as a way of gene-hunting and genetic testing.

Genetic linkage is the tendency for genes and other genetic markers to be inherited together because of their location near one another on the same chromosome.

Relative Risk is the risk of an event (or of developing a disease) relative to exposure.

Relative risk is a ratio of the probability of the event occurring in the exposed group versus a non-exposed group.

For example, if the probability of developing lung cancer among smokers was 20% and among non-smokers 1%, then the relative risk of cancer associated with smoking would be 20. Smokers would be twenty times as likely as non-smokers to develop lung cancer.

Association Analysis: Test for association between a genetic variation (e.g. SNP) and one or more quantitative traits

201

Page 200: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Association analysis

Genome-wide scans made possible by recent progress in Single Nucleotide Polymorphism (SNP) genotyping technologies

Introduction-Disease Gene Mapping

Linkage analysis

Very successful for Mendelian diseases (cystic fibrosis, Huntington’s,…)

Low power to detect genes with small relative risk in complex diseases [RischMerikangas’96]

Cases Controls

202

Page 201: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection-Motivation

Even low error levels can have large effects for some study designs (e.g. rare alleles, haplotype-based)

Errors as low as .1% can increase Type I error rates in haplotype sharing transmission disequilibrium test (HS-TDT) [Knapp&Becker04]

1% errors decrease power by 10-50% for linkage, and by 5-20% for association [Douglas et al. 00, Abecasis et al. 01]

203

Page 202: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Improved genotype calling algorithms [Marchini et al. 07, Nicolae et al. 06, Rabbee&Speed 05, Xiao

et al. 07] Explicit modeling in analysis methods

[Cheng 07, Hao & Wang 04, Liu et al. 07] Computationally complex

Separate error detection step Detected errors can be retyped, imputed, or ignored in

downstream analyses Common approach in pedigree genotype data analysis

[Abecasis et al. 02, Douglas et al. 00, Sobel et al. 02]

Genotype Error Detection-Motivation

204

Page 203: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Complexity of Computing Maximum Phasing Probability

• For unrelated genotypes, computing maximum phasing probability is hard to approximate within a factor of O(f½-) unless ZPP=NP, where f is the number of founders• For trios, hard to approx. within O(f1/4 -)• Reductions from the clique problem

205

Page 204: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

NGS Applications

Besides reducing costs of de novo genome sequencing, NGS has found many more apps:

Resequencing, transcriptomics (RNA-Seq), gene regulation (non-coding RNAs, transcription factor binding sites using ChIP-Seq), epigenetics (methylation, nucleosome modifications), metagenomics, paleogenomics, …

NGS is enabling personal genomics James Watson genome [Wheeler et al 08] sequenced

using 454 technology for ~$1 million compared to ~$100 million for the Sanger-sequenced Venter genome [Levy et al 07]

Thousands more individual genomes to be sequenced as part of 1000 Genomes Project 206

Page 205: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Challenges in Medical Applications of Sequencing

Medical sequencing focuses on genetic variation (SNPs, CNVs, genome rearrangements)

Requires accurate determination of both alleles at variable loci

Accuracy is limited by coverage depth due to random nature of shotgun sequencing

For the Venter and Watson genomes (both sequenced at ~7.5x average coverage), only 75-80% accuracy achieved for sequencing based calls of heterozygous SNPs [Levy et al 07, Wheeler et al 08].

[Wheeler et al 08] use hypothesis testing based on binomial distribution

To call a heterozygous genotype each allele must be covered by at least one read and the binomial probability for the observed number of 0 and 1 alleles must be at least 0.01

[Wendl&Wilson 08] predict that 21x coverage will be required for sequencing of normal tissue samples based on idealized theory that “neglects any heuristic inputs”

207

Page 206: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Prior Methods for Calling SNP Genotypes from Read Data

Prior methods are all based on allele coverage [Levy et al 07] require that each allele be covered by at

least 2 reads in order to be called [Wheeler et al 08] use hypothesis testing based on the

binomial distribution To call a heterozygous genotype must have each allele

covered by at least one read and the binomial probability for the observed number of 0 and 1 alleles must be at least 0.01

[Wendl&Wilson 08] generalize these methods by allowing an arbitrary minimum allele coverage k

208

Page 207: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Linkage Disequilibrium-Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads S. Dinakar1, Y. Hernandez2, J. Kennedy1, I. Mandoiu1, and Y. Wu11CSE Department, University of Connecticut 2Department of Computer Science, Hunter College

Motivation• Medical sequencing focuses on genetic variation (SNPs, CNVs,…)

– Requires determination of both alleles at variable loci; however, this is limited by coverage depth due to random nature of shotgun sequencing

– For the Venter and Watson genomes (both sequenced at ~7.5x average coverage), comparison with SNP genotyping chips shows ~75% accuracy for sequencing based calls of heterozygous SNPs [Levy et al 07, Wheeler et al 08]

• [Wendl&Wilson 08] estimate that 21x coverage will be required for medical sequencing of normal tissue samples – Based on idealized theory that “neglects heuristic inputs”

• We propose genotype calling methods that exploit following heuristic inputs– Quality scores reflecting uncertainty in sequencing data– Allele frequency and linkage disequilibrium info extracted from reference

panels such as Hapmap• Do heuristic inputs help?

– Experiments on a subset of James Watson 454 reads show that accuracy achieved by the binomial test of [Wheeler et al. 08] for 5.6x mapped read coverage is achieved by our methods using less than 1/4 of the reads

• Applying Bayes’ formula:

– Where are allele frequencies inferred from a representative panel

Single SNP Genotype Calling• Let ri denote the arbitrarily ordered set of mapped reads

covering SNP locus i and ci=| ri |– For a read r in ri , r(i) denotes the allele observed at locus i– If qr(i) is the phred quality score of r(i), the probability that r(i) is incorrect

is given by

• The probability of observing read set ri conditional on having genotype Gi:

1)(r

)(

0)(r

)( )1()0|r(ir

rir

irr

iriiii

GP

10/)(

)(10 irqir

0)(r

)(

1)(r

)( )1()2|r(ir

rir

irr

iriiii

GP ic

ii GP

21)1|r(

}2,1,0{)|r()(

)|r()()r|(g iiii

iiiiiii gGPgGP

gGPgPgGP

)( ii gGP

F1 F2 Fn…

H1 H2 Hn

G1 G2 Gn

…R1,1 R2,1

F'1 F'2 F'n…

H'1 H'2 H'n

R1,c … R2,c …Rn,1 Rn,c1 2 n

HMMs representing LD in populations of origin for mother/father; similar to models used in [Scheet & Stephens 06, Rastas et al. 08, Kennedy et al. 08]

Multilocus Model

• Initial founder probabilities P(f1), P(f’1), transition probabilities P(f i+1|fi), P(f’i+1|f’i), and emission probabilities P(hi|fi), P(h’i|f’i) trained using the Baum-Welch algorithm from haplotypes inferred from the populations of origin for mother/father

• P(gi|hi,h’i) set to 1 if h+h’i=gi and to 0 otherwise

– This implies that conditional probabilities for sets of reads are given by the formulas derived for the single SNP case:

Model Training

)(1)(

)()(

)()(

)(1)(, 1

221

2)|( ir

irir

iriir

irir

iri

iijigggGrRP

1)(r

)(

0)(r

)( )1()0|r(ir

rir

irr

iriiii

GP

0)(r

)(

1)(r

)( )1()2|r(ir

rir

irr

iriiii

GP ic

ii GP

21)1|r(

• Joint probabilities can be computed using a forward-backward algorithm:

• Direct implementation gives O(m+nK4) time, where – m = number of reads– n = number of SNPs– K = number of founder haplotypes in HMMs

• Runtime reduced to O(m+nK3) using speed-up idea similar to [Rastas et al. 08, Kennedy et al. 08]

)()|r()r,( '' ''1 ,1 ,, i

iff

K

fi

ffi

ff

K

fiii ggPgPiii iiiii

Posterior Decoding Algorithm1. For each i = 1..n, compute2. Return *)*,...,(* 1 nggg

)r,(maxarg)r|(maxarg* igigi gPgPgii

• Read data– 74.4 million reads with quality scores (of 106.5 million reads used in [Wheeler et

al 08]) downloaded from ftp://ftp.ncbi.nih.gov/pub/TraceDB/Personal_Genomics/Watson/

– Average read length ~265 bp

• Reference population data– CEU genotypes from latest Hapmap release (23a) from

http://ftp.hapmap.org/genotypes/latest_ncbi_build36/forward/non-redundant/

– Genotypes were phased using the ENT algorithm [Gusev et al. 08] and inferred haplotypes were used to train the parent HMM models using the Baum-Welch algorithm

• Genotype data– Duplicate Affymetrix 500k SNP genotypes downloaded from

ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/by_series/GSE10668/GSE10668_family.soft.gz

– We removed genotypes that were discordant in the two replicates and genotypes for which Hapmap and Affymetrix annotations had more than 5% in CEU same-strand allele frequency

Experimental Setup

• Reads mapped on human genome build 36.3 using the nucmertool of the MUMmer package [Kurtz et al 04]– Default nucmer parameters:

• MUM size 20, min cluster size 65, max gap between adjacent matches 90– Additional filtering

• At least 90% of the read length matched to the genome, no more than 10 errors (mismatches or indels)

– Reads meeting above conditions at multiple genome positions were discarded

• These reads are coming from genomic repeats, and is difficult to accurately map them

• Simulated 454 reads generated using ReadSim [Schmid et al 07] were used to estimate mapping error rates:– Estimated mapping FP rate is 0.37%

Read Mapping Procedure

Hapmap SNP coverage by mapped reads

0

100000

200000

300000

400000

500000

600000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

• Average coverage by mapped reads of Hapmap SNPs was 5.64x– Lower than [Wheeler et al 08] since we start with a subset of the reads

and use more stringent mapping constraints

Read Mapping Results

Homozygous Genotypes

97

97.5

98

98.5

99

99.5

100

0 10 20 30 40 50

% uncalled

% c

orre

ctly

cal

led

1SNP-Posterior Binomial0.01 HMM-Posterior

Genotyping Accuracy at 5.64x Read Coverage

Heterozygous Genotypes

80

82

84

86

88

90

92

94

96

98

100

0 10 20 30 40 50

% uncalled

% c

orre

ctly

cal

led

1SNP-Posterior Binomial0.01 HMM-Posterior

Conclusions & Ongoing Work• Exploiting “heuristic inputs” such as quality scores, population allele

frequency, and LD information yields significant improvements in genotyping calling accuracy from low-coverage sequencing data– Improvement depends on the coverage depth (higher at lower coverage)

• Accuracy achieved by binomial test of [Wheeler et al. 08] for 5.6x mapped read coverage is achieved by HMM-based posterior decoding using less than 1/4 of the reads; accuracy achieved by the binomial test for 2.8x coverage is achieved using less than 1/8 of the reads

– Small gain from incorporating quality scores may be due to poor calibration of 454 quality scores [Brockman et al. 08, Quinlan et al. 08]

– Although the evaluation is on 454 reads, our methods are well-suited for sequencing technologies with shorter reads

• Ongoing work includes modeling ambiguities in read mapping and extending the methods to population sequencing data (removing the need for reference panels)

ACKNOWLEDGEMENTSThis work was supported in part by NSF under awards IIS-0546457 and DBI-0543365 to IM and IIS-0803440 to YW. SD and YH performed this research as part of the Summer REU program “Bio-Grid Initiatives for Interdisciplinary Research and Education" funded by NSF under award CCF-0755373.

Genotyping Accuracy at Lower Read Coverage

70

75

80

85

90

95

100

0 10 20 30 40 50 60 70 80 90 100

% uncalled

% c

orre

ctly

cal

led

HMM-Posterior 1/16 HMM-Posterior 1/8 HMM-Posterior 1/4 HMM-Posterior 1/2 HMM-Posterior 1/1Binomial0.01 1/16 Binomial0.01 1/8 Binomial0.01 1/4 Binomial0.01 1/2 Binomial0.01 1/1

>gi|88943037|ref|NT_113796.1|Hs1_111515 Homo sapiens chromosome 1 genomic contig, reference assemblyGAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAGCTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTGGATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGTAATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCCCTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATATTTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGAAATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAGTCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGAATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC

>gi|88943037|ref|NT_113796.1|Hs1_111515 Homo sapiens chromosome 1 genomic contig, reference assemblyGAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAGCTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTGGATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGTAATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCCCTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATATTTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGAAATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAGTCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGAATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC

>gnl|ti|1779718824 name:EI1W3PE02ILQXT28 28 28 28 26 28 28 40 34 14 44 36 23 13 2 27 42 35 21 727 42 35 21 6 28 43 36 22 10 27 42 35 20 6 28 43 36 22 928 43 36 22 9 28 44 36 24 14 4 28 28 28 27 28 26 26 35 2640 34 18 3 28 28 28 27 33 24 26 28 28 28 40 33 14 28 36 2726 26 37 29 28 28 28 28 27 28 28 28 37 28 27 27 28 36 28 3728 28 28 27 28 28 28 24 28 28 27 28 28 37 29 36 27 27 28 2728 33 23 28 33 23 28 36 27 33 23 28 35 25 28 28 36 27 36 2728 28 28 24 28 37 29 28 19 28 26 37 29 26 39 33 13 37 28 2828 21 24 28 27 41 34 15 28 36 27 26 28 24 35 27 28 40 34 15

>gnl|ti|1779718824 name:EI1W3PE02ILQXT28 28 28 28 26 28 28 40 34 14 44 36 23 13 2 27 42 35 21 727 42 35 21 6 28 43 36 22 10 27 42 35 20 6 28 43 36 22 928 43 36 22 9 28 44 36 24 14 4 28 28 28 27 28 26 26 35 2640 34 18 3 28 28 28 27 33 24 26 28 28 28 40 33 14 28 36 2726 26 37 29 28 28 28 28 27 28 28 28 37 28 27 27 28 36 28 3728 28 28 27 28 28 28 24 28 28 27 28 28 37 29 36 27 27 28 2728 33 23 28 33 23 28 36 27 33 23 28 35 25 28 28 36 27 36 2728 28 28 24 28 37 29 28 19 28 26 37 29 26 39 33 13 37 28 2828 21 24 28 27 41 34 15 28 36 27 26 28 24 35 27 28 40 34 15

>gnl|ti|1779718824 name:EI1W3PE02ILQXTTCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTTGTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATTCTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGTTAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC>gnl|ti|1779718825 name:EI1W3PE02GTXK0TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTTTAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGCAGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCAGGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA

>gnl|ti|1779718824 name:EI1W3PE02ILQXTTCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTTGTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATTCTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGTTAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC>gnl|ti|1779718825 name:EI1W3PE02GTXK0TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTTTAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGCAGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCAGGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA

Proposed Pipeline for Genotype Calling from Short Reads

Mapped reads

Hapmap genotypes90 20934216 F 0 02110001?0100210010011002122201210211?122122021200018 F 15 1621100012010021001001100?100201?10111110111?021200015 M 0 0211200100120012010011200101101010111110111102120007 M 0 02110001001000200122110001111011100111?1212102220008 F 0 0011202100120022012211200101101210211122111?012000012 F 9 10211000100100020012211000101101110011121212102200009 M 0 0011?001?012002201221120010?1012102111221111012000011 M 7 821100210010002001221100012110111001112121210222000

90 20934216 F 0 02110001?0100210010011002122201210211?122122021200018 F 15 1621100012010021001001100?100201?10111110111?021200015 M 0 0211200100120012010011200101101010111110111102120007 M 0 02110001001000200122110001111011100111?1212102220008 F 0 0011202100120022012211200101101210211122111?012000012 F 9 10211000100100020012211000101101110011121212102200009 M 0 0011?001?012002201221120010?1012102111221111012000011 M 7 821100210010002001221100012110111001112121210222000

90 20934216 F 0 02110001?0100210010011002122201210211?122122021200018 F 15 1621100012010021001001100?100201?10111110111?021200015 M 0 0211200100120012010011200101101010111110111102120007 M 0 02110001001000200122110001111011100111?1212102220008 F 0 0011202100120022012211200101101210211122111?012000012 F 9 10211000100100020012211000101101110011121212102200009 M 0 0011?001?012002201221120010?1012102111221111012000011 M 7 821100210010002001221100012110111001112121210222000

Reference genome sequence

>gi|88943037|ref|NT_113796.1|Hs1_111515 Homo sapiens chromosome 1 genomic contig, reference assemblyGAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAGCTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTGGATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGTAATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCCCTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATATTTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGAAATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAGTCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGAATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC

… …

>gnl|ti|1779718824 name:EI1W3PE02ILQXTTCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTTGTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATTCTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGTTAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC>gnl|ti|1779718825 name:EI1W3PE02GTXK0TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTTTAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGCAGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCAGGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA

>gnl|ti|1779718824 name:EI1W3PE02ILQXT28 28 28 28 26 28 28 40 34 14 44 36 23 13 2 27 42 35 21 727 42 35 21 6 28 43 36 22 10 27 42 35 20 6 28 43 36 22 928 43 36 22 9 28 44 36 24 14 4 28 28 28 27 28 26 26 35 2640 34 18 3 28 28 28 27 33 24 26 28 28 28 40 33 14 28 36 2726 26 37 29 28 28 28 28 27 28 28 28 37 28 27 27 28 36 28 3728 28 28 27 28 28 28 24 28 28 27 28 28 37 29 36 27 27 28 2728 33 23 28 33 23 28 36 27 33 23 28 35 25 28 28 36 27 36 2728 28 28 24 28 37 29 28 19 28 26 37 29 26 39 33 13 37 28 2828 21 24 28 27 41 34 15 28 36 27 26 28 24 35 27 28 40 34 15

Read sequences

Quality scores

SNP genotype callsrs12095710 T T 9.988139e-01rs12127179 C T 9.986735e-01rs11800791 G G 9.977713e-01rs11578310 G G 9.980062e-01rs1287622 G G 8.644588e-01 rs11804808 C C 9.977779e-01rs17471528 A G 5.236099e-01rs11804835 C C 9.977759e-01rs11804836 C C 9.977925e-01rs1287623 G G 9.646510e-01 rs13374307 G G 9.989084e-01rs12122008 G G 5.121655e-01rs17431341 A C 5.290652e-01rs881635 G G 9.978737e-01 rs9700130 A A 9.989940e-01 rs11121600 A A 6.160199e-01rs12121542 A A 5.555713e-01rs11121605 T T 8.387705e-01rs12563779 G G 9.982776e-01rs11121607 C G 5.639239e-01rs11121608 G T 5.452936e-01rs12029742 G G 9.973527e-01rs562118 C C 9.738776e-01 rs12133533 A C 9.956655e-01rs11121648 G G 9.077355e-01rs9662691 C C 9.988648e-01 rs11805141 C C 9.928786e-01rs1287635 C C 6.113270e-01

Page 208: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

GEDI-ADMX presentation from earlier this year (Ft lauderdale)

Page 209: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Imputation-based local ancestry inference in admixed

populationsJustin Kennedy

Computer Science and Engineering DepartmentUniversity of Connecticut

Joint work with I. Mandoiu and B. Pasaniuc

211

Page 210: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Outline Introduction

Factorial HMM of genotype data

Algorithms for genotype imputation and ancestry inference

Preliminary experimental results

Conclusion

212

Page 211: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Introduction- Motivation: Admixture mapping

Patterson et al, AJHG 74:979-1000, 2004 213

Page 212: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Introduction- Local ancestry inference problem

rs11095710 T T rs11117179 C T rs11800791 G G rs11578310 G Grs1187611 G Grs11804808 C C rs17471518 A G...

Given: Reference haplotypes for ancestral populations P1,…,PN Whole-genome SNP genotype data for extant individual

Find: Allele ancestries at each SNP locus

Reference haplotypes

SNP genotypes

rs11095710 P1 P1rs11117179 P1 P1rs11800791 P1 P1rs11578310 P1 P2rs1187611 P1 P2rs11804808 P1 P2rs17471518 P1 P2...

1110001?0100110010011001111101110111?1111110111000 11100011010011001001100?100101?10111110111?0111000111100100110011010011100101101010111110111101110001110001001000100111110001111011100111?111110111000011101100110011011111100101101110111111111?011000011100010010001001111100010110111001111111110110000011?001?011001101111110010?1011101111111111011000011100110010001001111100011110111001111111110111000

Inferred local ancestry

1110001?0100110010011001111101110111?1111110111000 11100011010011001001100?100101?10111110111?0111000111100100110011010011100101101010111110111101110001110001001000100111110001111011100111?111110111000011101100110011011111100101101110111111111?011000011100010010001001111100010110111001111111110110000011?001?011001101111110010?1011101111111111011000011100110010001001111100011110111001111111110111000

1110001?0100110010011001111101110111?1111110111000 11100011010011001001100?100101?10111110111?0111000111100100110011010011100101101010111110111101110001110001001000100111110001111011100111?111110111000011101100110011011111100101101110111111111?011000011100010010001001111100010110111001111111110110000011?001?011001101111110010?1011101111111111011000011100110010001001111100011110111001111111110111000

214

Page 213: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Introduction- Previous work MANY methods

Ancestry inference at different granularities, assuming different kinds/amounts of info about genetic makeup of ancestral populations

Two main classes of methods HMM-based (exploit LD): SABER [Tang et al 06], SWITCH

[Sankararaman et al 08a], HAPAA [Sundquist et al. 08], …

Window-based (unlinked SNP Data): LAMP [Sankararaman et al 08b], WINPOP [Pasaniuc et al. 09]

Poor accuracy when ancestral populations are closely related (e.g. Japanese and Chinese)

Methods based on unlinked SNPs outperform methods that model LD!

215

Page 214: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Outline Introduction

Factorial HMM of genotype data

Algorithms for genotype imputation and ancestry inference

Preliminary experimental results

Conclusion

216

Page 215: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Haplotype structure in panmictic populations

217

Page 216: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

HMM of haplotype frequencies

Similar models proposed in [Schwartz 04, Rastas et al. 05, Kennedy et al. 07, Kimmel&Shamir 05, Scheet&Stephens 06,…]

K = 4(# founders)

n = 5(# SNPs)

218

Page 217: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Graphical model representation

Random variables for each locus i (i=1..n) Fi = founder haplotype at locus i; values between 1 and K Hi = observed allele at locus i; values: 0 (major) or 1

(minor) Model training

Based on reference haplotypes using Baum-Welch alg, or Based on unphased genotypes using EM [Rastas et al. 05]

Given haplotype h, P(H=h|M) can be computed in O(nK2) using a forward algorithm, where n=#SNPs, K=#founders

F1 F2 Fn…

H1 H2 Hn

219

Page 218: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

F1 F2 Fn…

H1 H2 Hn

F'1 F'2 F'n…

H'1 H'2 H'n

G1 G2 Gn

Factorial HMM for genotype data in a window with known local ancestry

klM

Random variable for each locus i (i=1..n) Gi = genotype at locus i; values: 0/1/2 (major hom./het./minor

hom.) 220

Page 219: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Outline Introduction

Factorial HMM of genotype data

Algorithms for genotype imputation and ancestry inference

Preliminary experimental results

Conclusion

221

Page 220: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

HMM Based Genotype Imputation

Probability of observing genotype at locus i given the known multilocus genotype with missing data at i:

gi is imputed as )|][(argmax }2,1,0{ MxggP ix

)|][(),|( MxggPMgxgP iii

x

222

Page 221: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

fi …

hi

gi

f’i …

h’i

Forward-backward computation

)()|( '' ''1 ,1 ,, i

iff

K

fi

ffi

ff

K

fgMgP

iii iiiii

223

Page 222: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

fi …

hi

gi

f’i …

h’i

Forward-backward computation

)()|( '' ''1 ,1 ,, i

iff

K

fi

ffi

ff

K

fgMgP

iii iiiii

224

Page 223: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

fi …

hi

gi

f’i …

h’i

Forward-backward computation

)()|( '' ''1 ,1 ,, i

iff

K

fi

ffi

ff

K

fgMgP

iii iiiii

225

Page 224: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

fi …

hi

gi

f’i …

h’i

Forward-backward computation

)()|( '' ''1 ,1 ,, i

iff

K

fi

ffi

ff

K

fgMgP

iii iiiii

226

Page 225: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

)()( '11

1, ' fPfP

ii ff

K

fi

iffii

K

fii

iff

iff

iii

iiiii

gffPffP1

11

,'1

'

11

1,,

1

'11'

1

'11

' )()|()|(

Runtime Direct recurrences for computing forward probabilities

O(nK4) :

Runtime reduced to O(nK3) by reusing common terms:

where )()|( 1

1

1,

'1

'1,,

'1

'11

'11

'1

i

K

f

iffii

iff

iff

gffPi

iiiiii

K

f

iffii

iff

iiiii

ffP1

,1,'1

'1

' )|(

227

Page 226: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Imputation-based ancestry inference

View local ancestry inference as a model selection problem Each possible local ancestry defines a factorial

HMM compute for all possible k,l,i,x values Pick model that re-imputes SNPs most

accurately around the locus i. Fixed-window version: pick ancestry that

maximizes the average posterior probability of the SNP genotypes within a fixed-size window centered at the locus

Multi-window version: weighted voting over window sizes between 200-3000, with window weights proportional to average posterior probabilities

klM),|( ,lkii MgxgP

11M 12M 22M

228

Page 227: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Local Ancestry at a locus is an unordered pair of (not necessarily distinct) ancestral populations.

Observations: The local ancestry of a SNP locus is typically shared with

neighboring loci. Small Window sizes may not provide enough

information Large Window sizes may violate local ancestry property

for neighboring loci When using the true values of in ,the accuracy

of SNP genotype imputation within such a neighborhood is typically higher than when using a mis-specified model.

klMlk,

Imputation-based ancestry inference

229

Page 228: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Outline Introduction

Factorial HMM of genotype data

Algorithms for genotype imputation and ancestry inference

Preliminary experimental results

Conclusion

230

Page 229: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

HMM imputation accuracy Missing data rate and accuracy for imputed

genotypes at different thresholds (WTCCC 58BC/Hapmap CEU)

231

Page 230: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

N=2,000g=7

=0.2n=38,864

r=10-8

Window size effect

232

Page 231: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Number of founders effect

CEU-JPTN=2,000

g=7=0.2

n=38,864 r=10-8

233

Page 232: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

N=2,000g=7

=0.2n=38,864

r=10-8

Comparison with other methods% of correctly recovered SNP ancestries

234

Page 233: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

N=2,000g=7

=0.5n=38,864

r=10-8

Untyped SNP imputation error rate in admixed individuals

235

Page 234: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Outline Introduction

Factorial HMM of genotype data

Algorithms for genotype imputation and ancestry inference

Preliminary experimental results

Conclusion

236

Page 235: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Conclusion-Summary and ongoing work

Imputation-based local ancestry inference achieves significant improvement over previous methods for admixtures between close ancestral populations

Code at http://dna.engr.uconn.edu/software/ Ongoing work

Evaluating accuracy under more realistic admixture scenarios (multiple ancestral populations/gene flow/drift in ancestral populations)

Extension to pedigree data Exploiting inferred local ancestry for more accurate

untyped SNP imputation and phasing of admixed individuals

Extensions to sequencing data Inference of ancestral haplotypes from extant admixed

populations 237

Page 236: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Questions?

238

Page 237: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

1. L.E. Baum, T. Petrie, G. Soules, and N.Weiss. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Statist., 41:164{171, 1970.

2. The Wellcome Trust Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447:661{678, 2007.3. Z. Ghahramani and M.I. Jordan. Factorial hidden Markov models. Mach. Learn., 29(2-3):245{273, 1997.

4. J. Kennedy, I.I. Mandoiu, and B. Pasaniuc. Genotype error detection using hidden markov models of haplotype diversity. Journal of Computational Biology, 15(9):1155{1171, 2008.

5. J. Kennedy, B. Pasaniuc, and I.I. Mandoiu. GEDI: Genotype error detection and imputation using hidden markov models of haplotype diversity, manuscript in preparation. software available at at http://dna.engr.uconn.edu/software/gedi/ .

6. G. Kimmel and R. Shamir. A block-free hidden Markov model for genotypes and its application to disease association. Journal of Computational Biology, 12:1243{1260, 2005.

7. Y. Li and G. R. Abecasis. Mach 1.0: Rapid haplotype reconstruction and missing genotype inference. American Journal of Human Genetics, 79:2290, 2006.

8. J. Marchini, C. Spencer, Y.Y. Teo, and P. Donnelly. A bayesian hierarchical mix ture model for genotype calling in a multi-cohort study. in preparation, 2007.

9. B. Pasaniuc, S. Sankararaman, G. Kimmel, and E. Halperin. Inference of locus-specic ancestry in closely related populations (under review).

10. E. J. Parra, A. Marcini, J. Akey, J. Martinson, M. A. Batzer, R. Cooper, T. For-rester, D. B. Allison, R. Deka, R. E. Ferrell, et al. Estimating african american admixture proportions by use of population-specic alleles. Am J Hum Genet, 63(6):1839{1851, December 1998.

11. P. Rastas, M. Koivisto, H. Mannila, and E. Ukkonen. Phasing genotypes using a hidden Markov model. In I.I. Mandoiu and A. Zelikovsky, editors, Bioinformatics Algorithms: Techniques and Applications, pages 355{372. Wiley, 2008.

12. D. Reich and Patterson N. Will admixture mapping work to nd disease genes? Philos Trans R Soc Lond B Biol Sci, 360:1605{1607, 2005.

13. S. Sankararaman, G. Kimmel, E. Halperin, and M.I. Jordan. On the inference of ancestries in admixed populations. Genome Research, (18):668{675, 2008.

14. S. Sankararaman, S. Sridhar, G. Kimmel, and E. Halperin. Estimating local ancestry in admixed populations. American Journal of Human Genetics, 8(2):290{303,2008.

15. P. Scheet and M. Stephens. A fast and exible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypicphase. American Journal of Human Genetics, 78:629{644, 2006.16. R. Schwartz. Algorithms for association study design using a generalized model of haplotype conservation. In Proc. CSB, pages 90{97, 2004.

17. M. W. Smith, N. Patterson, J. A. Lautenberger, A. L. Truelove, G. J. McDonald, A.Waliszewska, B. D. Kessing, M. J. Malasky, C. Scafe, E. Le, et al. A high-density admixture map for disease gene discovery in african americans. Am J Hum Genet,74(5):1001{1013, May 2004.

18. A. Sundquist, E. Fratkin, C.B. Do, and S. Batzoglou. Eect of genetic divergence in identifying ancestral origin using HAPAA. Genome Research, 18(4):676{682,2008.

19. H. Tang, M. Coram, P. Wang, X. Zhu, and N. Risch. Reconstructing genetic ancestry blocks in admixed individuals. Am J Hum Genet, 79:1{12, 2006.

20. H. Tang, Peng J., and Pei Wang P.and Risch N.J. Estimation of individual admixture: Analytical and study design considerations. Genetic Epidemiology, 28:289{

301, 2005. 21. C. Tian, D. A. Hinds, R. Shigeta, R. Kittles, D. G. Ballinger, and M. F. Seldin. A genomewide single-nucleotide-polymorphism panel with high ancestry information for African American admixture mapping. Am J Hum Genet, 79:640{649, 2006. 22. http://www.hapmap.org/.

239

Page 238: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Acknowledgments Work supported in part by NSF awards IIS-0546457

and DBI-0543365.

240

Page 239: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

HELPER SLIDES-STARTS NEXT PAGE

241

Page 240: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Introduction- Single Nucleotide Polymorphisms

Main form of variation between individual genomes: Single Nucleotide Polymorphisms (SNPs)

Human genome density: 1.2x107 out of 3109

base pairs Vast majority bi-allelic 0/1 encoding (major/minor

resp.)

SNP Genotypes are critical to Disease-Gene Mapping One Method: Admixture Mapping

… ataggtccCtatttcgcgcCgtatacacgggActata …

… ataggtccGtatttcgcgcCgtatacacgggTctata …

… ataggtccCtatttcgcgcGgtatacacgggTctata …

012100120

011000110

001100010

+two haplotypes per individual

genotype

242

Page 241: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Helper Slide- Other Software1) HMM-based methods:

a) SABERb) SWITCHc) HAPAA

2) Window based Majority vote:a) LAMP (no recombination assumption)b) WINPOP (a more refined model of recombination events coupled with an

adaptive window size computation to achieve increased accuracy.

243

Page 242: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Helper Slide- Probabilities

)[(~)|( g]gg i gPgGPklkl MiiM

:,...,,,...,()[ n1i1-i1i ggg,ggg]g gg

:),...,,,...,( n1i1-i1 gggg ig:}2,1,0{g:}2,1,0{iG

:klM

Random variable genotype at SNP i

Multilocus genotype with I set to

Multilocus genotype without i

Genotype variable taken at SNP i

HMM with ancestral pair k,l

Page 243: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Helper Slide- Emission Details

ig

igli

ki

li

ki

li

ki

hhhh

li

li

ki

kiFF

i FhPFhP}1,0{,

)|()|()(E

245

Page 244: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Input For every Window half-size w

Output (Single Window method) For every i=1..n:

Where:

)|( iiM gGPkl g {0,1,2}g,, klMi

i

klWj

iiMi

Akli gGPW

a )|(||

1maxargˆ 1ig

}1|{ nlkklA

}},min{},...,1{max{ winwiWi

Window-based local ancestry inference

Aai ˆ

246

Page 245: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….

…….…………………………..……

k

l

YORP :1 CEUP :2

???????????????????????????????????????????????????????????????????????????????????????

???????????????????????????????????????????????????????????????????????????????????????

i

klWj

iiMi

Akli gGPW

a )|(||

1maxargˆ 1ig

Window-based local ancestry inference

247

Page 246: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

??????????????????????????????????????????????????????????????????????????????????????

6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….

…….…………………………..……

k

l

YORP :1 CEUP :2

??????????????????????????????????????????????????????????????????????????????????????

}11..1{

11 )(111maxargˆ

jMAkl GPa

kl 1g 10w }11,..,1{1 W

}11..1{}11..1{}11..1{

221211 111

111

111

jM

jM

jM PPP

1

1

Window-based local ancestry inference

248

Page 247: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….

…….…………………………..……

k

l

YORP :1 CEUP :2

10w }12,..,1{2 W

1

1

}12..1{

2222 )|(121maxargˆ

jMAkl gGPa

kl 2g

1

1

}12..1{}12..1{}12..1{

221211 121

121

121

jM

jM

jM PPP

????????????????????????????????????????????????????????????????????????????????????

????????????????????????????????????????????????????????????????????????????????????

Window-based local ancestry inference

249

Page 248: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

??????????????????????????????????????????????????????????????????????????????????

??????????????????????????????????????????????????????????????????????????????????

6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….

…….…………………………..……

k

l

YORP :1 CEUP :2

10w }13,..,1{3 W

1

1

}13..1{

3333 )|(131maxargˆ

jMAkl gGPa

kl 3g

1

1

}13..1{}13..1{}13..1{

221211 131

131

131

jM

jM

jM PPP

1

1

Window-based local ancestry inference

250

Page 249: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

????????????????????????????????????????????????????????????????????????????????

????????????????????????????????????????????????????????????????????????????????

6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….

…….…………………………..……

k

l

YORP :1 CEUP :2

10w }14,..,1{2 W

1

1

}14..1{

4444 )|(141maxargˆ

jMAkl gGPa

kl 4g

1

1

}14..1{}14..1{}14..1{

221112 141

141

141

jM

jM

jM PPP

1

1

1

2

Window-based local ancestry inference

251

Page 250: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

1112222222222222221111111111111122222222222222222222222221111111111111222222

1111111112222222211111111111111112222 ???????????????????????????????????????????

1112222222222222221111111111111122222 ???????????????????????????????????????????

1111111112222222211111111111111112222222222222222111111111111111111111111111

6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….

…….…………………………..……

k

l

YORP :1 CEUP :2

10w }10,..,10{ iiWi

iii Wj

MWj

MWj

M PPP221112 21

1211

211

1

2

i

klWj

iiMAkli gGPa )|(211maxargˆ 1ig

Window-based local ancestry inference

252

Page 251: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Experimental Results- GEDI 1-pop Imputation

1,444 individuals trained on HAPMAP

CEU haplotype reference panel

Imputed (after masking) 1% of SNPs on chromosome 22

253

Page 252: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Helper Slide-Software Overview Only Require information about ancestral allele

frequencies: LAMP WINPOP SWITCH (HMM-Based)

Only require ancestral allele frequencies & genotypes

SABER (HMM-Based) Additionally use ancestral haplotype

information: HAPAA (HMM-Based) GEDI-ADMX (HMM-Based)

254

Page 253: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection-Hidden Markov Model of Haplotype Diversity

Similar HMMs proposed by [Kimmel &Shamir 05, Rastas et al. 05, Schwartz 04]

Paths with high transition probability correspond to “founder” haplotypes

Haplotype sequence/paths computed using Viterbi and forward algorithms

K= #Founders(E.g. K=4)

Transition Prob

Emission Prob

n= #SNPs(E.g. n=5)

:),'( qqj:);( qjE

n= #SNPs

F1

H1

F2

H2

Fi

Hi

Fn

Hn(Graphical model representation)

255

Page 254: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection-Hidden Markov Model of Haplotype Diversity

n= #SNPs

F1

H1

F2

H2

Fi

Hi

Fn

Hn(Graphical model representation)

Random variables for each locus i (i=1..n) Fi = founder haplotype at locus i; values between 1 and K Hi = observed allele at locus i; values: 0 (major) or 1

(minor) Given haplotype h, P(H=h|M) can be computed in

O(nK2) using a forward algorithm, where n=#SNPs, K=#founders 256

Page 255: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection-Hidden Markov Model of Haplotype Diversity

n= #SNPs

F1

H1

F2

H2

Fi

Hi

Fn

Hn(Graphical model representation)

Training: 2- step algorithm that exploits pedigree info

Step 1: Obtain haplotypes from using either: ENT: A pedigree-aware haplotype phasing algorithm based on

entropy-minimization Haplotype reference panel (e.g. HAPMAP)

Step 2: train HMM based on inferred haplotypes, using Baum-Welch 257

Page 256: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection- Factorial HMM for multilocus genotype data

n= #SNPs

F1

H1

F2

H2

Fi

Hi

Fn

Hn

F1

H1

F2

H2

Fi

Hi

Fn

Hn

G1 G2 GnGi

258

Page 257: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Genotype Error Detection- Factorial HMM for multilocus trio data

n= #SNPsF1

H1

Fi

Hi

Fn

Hn

F1

H1

Fi

Hi

Fn

Hn

M1 Mi Mn

F1

H1

Fi

Hi

Fn

Hn

F1

H1

Fi

Hi

Fn

Hn

F1 Fi Fn

G1 Gi Gn

259

Page 258: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

(Graphical model representation)

Random variables for each locus i (i=1..n) Fi = founder haplotype at locus i; values between 1 and K Hi = observed allele at locus i; values: 0 (major) or 1

(minor) Model training

Based on reference haplotypes using Baum-Welch alg, or Based on unphased genotypes using EM [Rastas et al. 05]

Given haplotype h, P(H=h|M) can be computed in O(nK2) using a forward algorithm, where n=#SNPs, K=#founders

F1

H1

F2

H2

F3

H3

F4

H4

F5

H5

260

Page 259: Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

F1 F2 Fn…

H1 H2 Hn

F'1 F'2 F'n…

H'1 H'2 H'n

G1 G2 Gn

Factorial HMM for genotype data in a window with known local ancestry

klM

Random variable for each locus i (i=1..n) Gi = genotype at locus i; values: 0/1/2 (major hom./het./minor

hom.) 261