advanced algorithms and models for computational biology -- a machine learning approach

Advanced Algorithms Advanced Algorithms and Models for and Models for

Computational BiologyComputational Biology-- a machine learning approach-- a machine learning approach

Population Genetics:Population Genetics:

More on halpltypesMore on halpltypes--- coalescence, multi-population phasing, blocks, --- coalescence, multi-population phasing, blocks,

etc.etc.

Eric XingEric Xing

Lecture 19, March 27, 2006

Clustering

How to label them ? inference

How many clusters ??? model selection ? or inference ?

Genetic Demography

Are there genetic prototypes among them ? What are they ? How many ? (how many ancestors do we have ?)

Inference done separately, or jointly?

Multi-population Genetic Demography

Clustering as Mixture Modeling

Model selection "intelligent" guess: ??? cross validation: data-hungry information theoretic:

AIC TIC MDL :

Posterior inference:

we want to handle uncertainty of model complexity explicitly

we favor a distribution that does not constrain M in a "closed" space!

Model Selection vs. Posterior Inference

),ˆ|(|)(minarg KKL MLgf

)()|()|( MpMDpDMp

K,M

Parsimony, Ockam's RazorParsimony, Ockam's Razor

C T A

T G A

C G A

T T A

??????

haplotype h(h1, h2)possible associations of alleles to chromosomes

Heterozygousdiploid individual

C T A

T G ACp

Cm

Genotype gpairs of alleles, whose

associations to chromosomes are unknown

ATGCsequencing

TC TG AA

This is a mixture modeling problem!

Haplotype Ambiguity

The probability of a genotype g:

Standard settings: H| = K << 2J fixed-sized population haplotype pool

p(h1,h2)= p(h1)p(h2)=f1f2 Hardy-Weinberg equilibrium

Problem: K ? H ?

,

2121

21

),|(),()(Hhh

hhgphhpgp

Genotypingmodel

Haplotypemodel

Population haplotypepool

A Finite (Mixture of ) Allele Model

Gn

Hn1 Hn2

Present

Time

22 individuals

The coalescent process

Present

Time

22 individuals

18 ancestors


Present

Time

22 individuals

18 ancestors

16 ancestors


Present

Time

22 individuals

18 ancestors

16 ancestors

14 ancestors


Present

Time

22 individuals

18 ancestors

16 ancestors

14 ancestors

12 ancestors

9 ancestors

8 ancestors

8 ancestors

7 ancestors

7 ancestors

5 ancestors

5 ancestors

3 ancestors

3 ancestors

3 ancestors

2 ancestors

2 ancestors

1 ancestor

Present

Time

Most recent common ancestor(MRCA)

Mutational events can now be added to the genealogical tree, resulting in polymorphic sites. If these sites are typed in the modern sample, they can be used to split the sample into sub-clades (represented by different colours)

Present

Time

Most recent common ancestor(MRCA)

TCGAGGTATTAACTCTAGGTATTAACTCGAGGCATTAACTCTAGGTGTTAACTCGAGGTATTAGCTCTAGGTATCAAC * ** * *

Present

Time

TCGAGGTATTAACTCTAGGTATTAACTCGAGGCATTAACTCTAGGTGTTAACTCGAGGTATTAGCTCTAGGTATCAAC * ** * *

Assuming there are presently k active lineages: The probability of coalescence:

The probability of mutation (killing a lineage):

11

kk

1k

Population Genetic Basis of an Infinite Allele model

≈

Natural genealogy

N

Infinite mixturesCoalescent with mutation

Kingman coalescent process with binary lineage merging Kingman coalescent process with binary lineage merging New population haplotype alleles emerge along all branches of the New population haplotype alleles emerge along all branches of the

coalescence tree at rate coalescence tree at rate /2 per unit length/2 per unit length

The Ewens Sampling Formula: an exchangeable random partition of The Ewens Sampling Formula: an exchangeable random partition of individualsindividuals

Dirichlet process mixtureDirichlet process mixture

A Hierarchical Bayesian Infinite Allele model

).},{|(~ kahph

• Assume anAssume an individual haplotypeindividual haplotype hh is stochastically is stochastically derived from aderived from a population haplotypepopulation haplotype ak withwith

nucleotide-substitution frequencynucleotide-substitution frequency k: :

• Not knowing the correspondences between individual Not knowing the correspondences between individual and population haplotypes, each individual haplotype and population haplotypes, each individual haplotype is a mixture of population haplotypesis a mixture of population haplotypes..

• The number and identity of the population haplotypes are unknownThe number and identity of the population haplotypes are unknown

use ause a Dirichlet Process Dirichlet Process to construct a priorto construct a prior distributiondistribution GG on on HH××RRJJ..

Gn

Hn1 Hn2

Ak k

G

G0

Dirichlet Process Priors

A c.d.f., G onfollows a Dirichlet Process if for any measurable finite partition of (B1,B2, .., Bm), of , the joint distribution of the random variables

(G(B1), G(B2), …, G(Bm)) ~ Dirichlet(G0(B1), …., G0(Bm)),

where, G0 is a the base distribution and a is the precision parameter

Stick-breaking Process

G0

0 0.4 0.4

0.6 0.5 0.3

0.3 0.8 0.24

),Beta(~

)-(

~

)(

∏

∑

∑

-

∞

∞

1

1

1

1

1

1

0

1

k

k

jkkk

kk

k

kkk

G

G

Location

Mass

Chinese Restaurant Process

CRP defines an exchangeable distribution on partitions over an (infinite) sequence of integers

=)|=( -ii kcP c 1 0 0

+1

10

+1

+2

1

+2

1

+2

+3

1

+3

2

+3

1-+1

i

m

1-+2

i

m

1-+

i....

1 2

{A} {A} {A} {A} {A} {A} ……3

12 4

56 7

8 9

The DP Mixture of Ancestral Haplotypes

The customers around a table form a cluster associate a mixture component (i.e., a population haplotype) with a table

sample {a, } at each table from a base measure G0 to obtain the population haplotype and nucleotide substitution frequency for that component

With p(h|{}) and p(g|h1,h2), the CRP yields a posterior distribution on the number of population haplotypes (and on the haplotype configurations and the nucleotide substitution frequencies)

DP-haplotyper

Inference: Markov Chain Monte Carlo (MCMC) Gibbs sampling Metropolis Hasting

Gn

Hn1 Hn2

A

N

K

G

G0 DP

infinite mixture components(for population haplotypes)

Likelihood model(for individual

haplotypes and genotypes)

Model components

Choice of base measure:

Nucleotide-substitution model:

Noisy genotyping model:

j

jaG )Beta()Unif(~ 0

jkjijk

jkjijkjkjkji

jjkjkjiki

ah

ahahp

ahpahp

,,,

,,,

,,,

,,,

if

if ),|( where

),|()},{|(

1

jijiji

jijiji

jijiji

jjijijiiii

ghh

ghhhhgp

hhgphhgp

,,,

,,,

,,,

,,,

if 2

if ),|( where

),|(),|(

21

21

21

2121

1

Gibbs sampling

Starting from some initial haplotype reconstruction H(0) , pick a first table with an arbitrary a1

(0) , and form initial population-hap pool A(0) ={a1

(0) }:

i) Choose an individual i and one of his/her two haplytopes t, uniformly and at random, from all ambiguous individuals;

ii) Sample from , update ;

iii) Sample , where , from ;

update A(t+1) ;

iii) Sample from , update H(t+1).

),,|( )()()()1( ttti

ti Hccp

ttA

)1( tit

c

)1( tka )1( t

itck ) s.t. |( )1(

')('

)1(

''kchap t

iti

tk tt

)1( tit

h ),,|( )1()()1()1(

tti

ti

ti ttt

Hchp A

)1( tc

Convergence of Ancestral Inference

Haplotyping Error

The Gabriel data

Multi-population Genetic Gemography

Pool everything together and solve 1 hap problem? --- ignore population structures

Solve 4 hap problems separately? --- data fragmentation

Co-clustering … solve 4 coupled hap problems jointly

Why humans are so similar and polymorphic patterns are regional

Population bottleneck: a small population that interbred reduced the genetic variation

Out of Africa ~ 100,000 years ago

Out of Africa

Each population can be associated with a unique DP capturing population-specific genetic demography

Different population may have unique haplotypes

Different population may share

common haplotypes

Thus Population specific DPs

are marginally dependent

Population Specific DPs

GG11

GG22

GG44

GG33

Hierarchical DP Mixture

GG11

GG22

GG44

GG33

....

HH

2

1

3

4

Simulated Data

As 1 HapTyping problem

As 5 HapTyping problems

HapMap Data

The block structure of haplotypes

The Daly et al (2001) data set This consists of 103 common SNPs (>5% minor allele frequency) in a 500 kb region

implicated in Crohn disease, genotyped in 129 trios (mom, pop, kid) from a European derived population, giving 258 transmitted and 258 untransmitted chromosomes.

The haplotype blocks span up to 100kb and contain 5 or more common SNPs. For example, one 84 kb block of 8 SNPs shows just two distinct haplotypes accounting for 95% of the observed chromosomes.

Another study: the Patil et al data

The haplotype patterns for 20 independent globally diverse chromosomes defined by 147 common human chr 21 SNPs spanning 106 kb of genomic sequence. Each row represents an SNP. Blue box = major, yellow = minor allele. Each column represents a single

chromosome.

The 147 SNPs are divided into 18 blocks defined by black lines. The expanded box on the right is an SNP

block of 26 SNPs over 19kb of genomic DNA. The 4 most common of 7 different haplotypes

include 80% of the chromosomes, and can be distinguished with 2 SNPs.

Figure 2 of Patil et al

Haplotype Blocks

Markov Dependence Between Blocks

On any chromosome, the haplotype carried at the kth block is drawn from A(k) according to unknown probabilities that depend on the haplotype at block k − 1.

Block Partitioning Algorithms

Greedy algorithm (Patil et al 2001). Begin by considering all possible blocks of ≥1 consecutive SNPs. Next, exclude all blocks in which < 80% of the chromosomes in the data are defined

by haplotypes represented more than once in the block (80% coverage). Considering the remaining overlapping blocks simultaneously, select the one which

maximizes the ratio of total SNPs in the block to the number required to uniquely discriminate haplotypes represented more than once in the block. Any of the remaining blocks that physically overlap with the selected block are discarded, and the process repeated until we have selected a set of contiguous, non-overlapping blocks that cover the 32.4 Mb of chr 21 2ith no gaps and with every SNP assigned to a block.

Hidden Markov Models Maximum a posterior inference (i.e., viterbi) (Daly et al. 2001) Minimum description length (Anderson et al.2003)

Dynamic programming (Sun et al. 2002)

Haplotypes are Shaped by Recombination

Inheritance of unknown generation

....

Hidden Markov DP for Recombination

HH

y1 y2 y3 yN…

....

HH

y1y1 y2y2 y3y3 yNyN…

....

Reference

E.P. Xing, R. Sharan and M.I Jordan, Bayesian Haplotype Inference via the Dirichlet Process. Proceedings of the 21st International Conference on Machine Learning (ICML2004),

N Patil et al . Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21 Science 294 2001:1719-1723.

M J Daly et al . High-resolution haplotype structure in the human genome Nat. Genet. 29 2001: 229-232

Anderson, E.C., Novembre, J. (2003) "Finding haplotype block boundaries using the minimum description length principle." American Journal of Human Genetics 73(2):336-354.

advanced algorithms and models for computational biology -- a machine learning approach

Documents