advanced algorithms and models for computational biology -- a machine learning approach

42
Advanced Algorithms Advanced Algorithms and Models for and Models for Computational Biology Computational Biology -- a machine learning approach -- a machine learning approach Population Genetics: Population Genetics: More on halpltypes More on halpltypes --- coalescence, multi-population --- coalescence, multi-population phasing, blocks, etc. phasing, blocks, etc. Eric Xing Eric Xing Lecture 19, March 27, 2006

Upload: meli

Post on 08-Jan-2016

29 views

Category:

Documents


3 download

DESCRIPTION

Advanced Algorithms and Models for Computational Biology -- a machine learning approach. Population Genetics: More on halpltypes --- coalescence, multi-population phasing, blocks, etc. Eric Xing Lecture 19, March 27, 2006. Clustering. How to label them ? inference How many clusters ??? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Advanced Algorithms Advanced Algorithms and Models for and Models for

Computational BiologyComputational Biology-- a machine learning approach-- a machine learning approach

Population Genetics:Population Genetics:

More on halpltypesMore on halpltypes--- coalescence, multi-population phasing, blocks, --- coalescence, multi-population phasing, blocks,

etc.etc.

Eric XingEric Xing

Lecture 19, March 27, 2006

Page 2: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Clustering

How to label them ? inference

How many clusters ??? model selection ? or inference ?

Page 3: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Genetic Demography

Are there genetic prototypes among them ? What are they ? How many ? (how many ancestors do we have ?)

Page 4: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Inference done separately, or jointly?

Multi-population Genetic Demography

Page 5: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Clustering as Mixture Modeling

Page 6: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Model selection "intelligent" guess: ??? cross validation: data-hungry information theoretic:

AIC TIC MDL :

Posterior inference:

we want to handle uncertainty of model complexity explicitly

we favor a distribution that does not constrain M in a "closed" space!

Model Selection vs. Posterior Inference

),ˆ|(|)(minarg KKL MLgf

)()|()|( MpMDpDMp

K,M

Parsimony, Ockam's RazorParsimony, Ockam's Razor

Page 7: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

C T A

T G A

C G A

T T A

??????

haplotype h(h1, h2)possible associations of alleles to chromosomes

Heterozygousdiploid individual

C T A

T G ACp

Cm

Genotype gpairs of alleles, whose

associations to chromosomes are unknown

ATGCsequencing

TC TG AA

This is a mixture modeling problem!

Haplotype Ambiguity

Page 8: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

The probability of a genotype g:

Standard settings: H| = K << 2J fixed-sized population haplotype pool

p(h1,h2)= p(h1)p(h2)=f1f2 Hardy-Weinberg equilibrium

Problem: K ? H ?

,

2121

21

),|(),()(Hhh

hhgphhpgp

Genotypingmodel

Haplotypemodel

Population haplotypepool

A Finite (Mixture of ) Allele Model

Gn

Hn1 Hn2

Page 9: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Present

Time

22 individuals

The coalescent process

Page 10: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Present

Time

22 individuals

18 ancestors

The coalescent process

Page 11: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Present

Time

22 individuals

18 ancestors

16 ancestors

The coalescent process

Page 12: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Present

Time

22 individuals

18 ancestors

16 ancestors

14 ancestors

The coalescent process

Page 13: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Present

Time

22 individuals

18 ancestors

16 ancestors

14 ancestors

12 ancestors

9 ancestors

8 ancestors

8 ancestors

7 ancestors

7 ancestors

5 ancestors

5 ancestors

3 ancestors

3 ancestors

3 ancestors

2 ancestors

2 ancestors

1 ancestor

Page 14: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Present

Time

Most recent common ancestor(MRCA)

Page 15: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Mutational events can now be added to the genealogical tree, resulting in polymorphic sites. If these sites are typed in the modern sample, they can be used to split the sample into sub-clades (represented by different colours)

Page 16: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Present

Time

Most recent common ancestor(MRCA)

TCGAGGTATTAACTCTAGGTATTAACTCGAGGCATTAACTCTAGGTGTTAACTCGAGGTATTAGCTCTAGGTATCAAC * ** * *

Page 17: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Present

Time

TCGAGGTATTAACTCTAGGTATTAACTCGAGGCATTAACTCTAGGTGTTAACTCGAGGTATTAGCTCTAGGTATCAAC * ** * *

Assuming there are presently k active lineages: The probability of coalescence:

The probability of mutation (killing a lineage):

11

kk

1k

Page 18: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Population Genetic Basis of an Infinite Allele model

Natural genealogy

N

Infinite mixturesCoalescent with mutation

Kingman coalescent process with binary lineage merging Kingman coalescent process with binary lineage merging New population haplotype alleles emerge along all branches of the New population haplotype alleles emerge along all branches of the

coalescence tree at rate coalescence tree at rate /2 per unit length/2 per unit length

The Ewens Sampling Formula: an exchangeable random partition of The Ewens Sampling Formula: an exchangeable random partition of individualsindividuals

Dirichlet process mixtureDirichlet process mixture

Page 19: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

A Hierarchical Bayesian Infinite Allele model

).},{|(~ kahph

• Assume anAssume an individual haplotypeindividual haplotype hh is stochastically is stochastically derived from aderived from a population haplotypepopulation haplotype ak withwith

nucleotide-substitution frequencynucleotide-substitution frequency k: :

• Not knowing the correspondences between individual Not knowing the correspondences between individual and population haplotypes, each individual haplotype and population haplotypes, each individual haplotype is a mixture of population haplotypesis a mixture of population haplotypes..

• The number and identity of the population haplotypes are unknownThe number and identity of the population haplotypes are unknown

use ause a Dirichlet Process Dirichlet Process to construct a priorto construct a prior distributiondistribution GG on on HH××RRJJ..

Gn

Hn1 Hn2

Ak k

G

G0

Page 20: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Dirichlet Process Priors

A c.d.f., G onfollows a Dirichlet Process if for any measurable finite partition of (B1,B2, .., Bm), of , the joint distribution of the random variables

(G(B1), G(B2), …, G(Bm)) ~ Dirichlet(G0(B1), …., G0(Bm)),

where, G0 is a the base distribution and a is the precision parameter

Page 21: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Stick-breaking Process

G0

0 0.4 0.4

0.6 0.5 0.3

0.3 0.8 0.24

),Beta(~

)-(

~

)(

-

1

1

1

1

1

1

0

1

k

k

jkkk

kk

k

kkk

G

G

Location

Mass

Page 22: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Chinese Restaurant Process

CRP defines an exchangeable distribution on partitions over an (infinite) sequence of integers

=)|=( -ii kcP c 1 0 0

+1

10

+1

+2

1

+2

1

+2

+3

1

+3

2

+3

1-+1

i

m

1-+2

i

m

1-+

i....

1 2

Page 23: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

{A} {A} {A} {A} {A} {A} ……3

12 4

56 7

8 9

The DP Mixture of Ancestral Haplotypes

The customers around a table form a cluster associate a mixture component (i.e., a population haplotype) with a table

sample {a, } at each table from a base measure G0 to obtain the population haplotype and nucleotide substitution frequency for that component

With p(h|{}) and p(g|h1,h2), the CRP yields a posterior distribution on the number of population haplotypes (and on the haplotype configurations and the nucleotide substitution frequencies)

Page 24: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

DP-haplotyper

Inference: Markov Chain Monte Carlo (MCMC) Gibbs sampling Metropolis Hasting

Gn

Hn1 Hn2

A

N

K

G

G0 DP

infinite mixture components(for population haplotypes)

Likelihood model(for individual

haplotypes and genotypes)

Page 25: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Model components

Choice of base measure:

Nucleotide-substitution model:

Noisy genotyping model:

j

jaG )Beta()Unif(~ 0

jkjijk

jkjijkjkjkji

jjkjkjiki

ah

ahahp

ahpahp

,,,

,,,

,,,

,,,

if

if ),|( where

),|()},{|(

1

jijiji

jijiji

jijiji

jjijijiiii

ghh

ghhhhgp

hhgphhgp

,,,

,,,

,,,

,,,

if 2

if ),|( where

),|(),|(

21

21

21

2121

1

Page 26: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Gibbs sampling

Starting from some initial haplotype reconstruction H(0) , pick a first table with an arbitrary a1

(0) , and form initial population-hap pool A(0) ={a1

(0) }:

i) Choose an individual i and one of his/her two haplytopes t, uniformly and at random, from all ambiguous individuals;

ii) Sample from , update ;

iii) Sample , where , from ;

update A(t+1) ;

iii) Sample from , update H(t+1).

),,|( )()()()1( ttti

ti Hccp

ttA

)1( tit

c

)1( tka )1( t

itck ) s.t. |( )1(

')('

)1(

''kchap t

iti

tk tt

)1( tit

h ),,|( )1()()1()1(

tti

ti

ti ttt

Hchp A

)1( tc

Page 27: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Convergence of Ancestral Inference

Page 28: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Haplotyping Error

The Gabriel data

Page 29: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Multi-population Genetic Gemography

Pool everything together and solve 1 hap problem? --- ignore population structures

Solve 4 hap problems separately? --- data fragmentation

Co-clustering … solve 4 coupled hap problems jointly

Page 30: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Why humans are so similar and polymorphic patterns are regional

Population bottleneck: a small population that interbred reduced the genetic variation

Out of Africa ~ 100,000 years ago

Out of Africa

Page 31: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Each population can be associated with a unique DP capturing population-specific genetic demography

Different population may have unique haplotypes

Different population may share

common haplotypes

Thus Population specific DPs

are marginally dependent

Population Specific DPs

GG11

GG22

GG44

GG33

Page 32: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Hierarchical DP Mixture

GG11

GG22

GG44

GG33

....

HH

2

1

3

4

Page 33: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Simulated Data

As 1 HapTyping problem

As 5 HapTyping problems

Page 34: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

HapMap Data

Page 35: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

The block structure of haplotypes

The Daly et al (2001) data set This consists of 103 common SNPs (>5% minor allele frequency) in a 500 kb region

implicated in Crohn disease, genotyped in 129 trios (mom, pop, kid) from a European derived population, giving 258 transmitted and 258 untransmitted chromosomes.

The haplotype blocks span up to 100kb and contain 5 or more common SNPs. For example, one 84 kb block of 8 SNPs shows just two distinct haplotypes accounting for 95% of the observed chromosomes.

Page 36: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Another study: the Patil et al data

The haplotype patterns for 20 independent globally diverse chromosomes defined by 147 common human chr 21 SNPs spanning 106 kb of genomic sequence. Each row represents an SNP. Blue box = major, yellow = minor allele. Each column represents a single

chromosome.

The 147 SNPs are divided into 18 blocks defined by black lines. The expanded box on the right is an SNP

block of 26 SNPs over 19kb of genomic DNA. The 4 most common of 7 different haplotypes

include 80% of the chromosomes, and can be distinguished with 2 SNPs.

Figure 2 of Patil et al

Page 37: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Haplotype Blocks

Page 38: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Markov Dependence Between Blocks

On any chromosome, the haplotype carried at the kth block is drawn from A(k) according to unknown probabilities that depend on the haplotype at block k − 1.

Page 39: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Block Partitioning Algorithms

Greedy algorithm (Patil et al 2001). Begin by considering all possible blocks of ≥1 consecutive SNPs. Next, exclude all blocks in which < 80% of the chromosomes in the data are defined

by haplotypes represented more than once in the block (80% coverage). Considering the remaining overlapping blocks simultaneously, select the one which

maximizes the ratio of total SNPs in the block to the number required to uniquely discriminate haplotypes represented more than once in the block. Any of the remaining blocks that physically overlap with the selected block are discarded, and the process repeated until we have selected a set of contiguous, non-overlapping blocks that cover the 32.4 Mb of chr 21 2ith no gaps and with every SNP assigned to a block.

Hidden Markov Models Maximum a posterior inference (i.e., viterbi) (Daly et al. 2001) Minimum description length (Anderson et al.2003)

Dynamic programming (Sun et al. 2002)

Page 40: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Haplotypes are Shaped by Recombination

Inheritance of unknown generation

....

Page 41: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Hidden Markov DP for Recombination

HH

y1 y2 y3 yN…

....

HH

y1y1 y2y2 y3y3 yNyN…

....

Page 42: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Reference

E.P. Xing, R. Sharan and M.I Jordan, Bayesian Haplotype Inference via the Dirichlet Process. Proceedings of the 21st International Conference on Machine Learning (ICML2004),

N Patil et al . Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21 Science 294 2001:1719-1723.

M J Daly et al . High-resolution haplotype structure in the human genome Nat. Genet. 29 2001: 229-232

Anderson, E.C., Novembre, J. (2003) "Finding haplotype block boundaries using the minimum description length principle." American Journal of Human Genetics 73(2):336-354.