advanced algorithms and models for computational biology -- a machine learning approach
DESCRIPTION
Advanced Algorithms and Models for Computational Biology -- a machine learning approach. Population Genetics: More on halpltypes --- coalescence, multi-population phasing, blocks, etc. Eric Xing Lecture 19, March 27, 2006. Clustering. How to label them ? inference How many clusters ??? - PowerPoint PPT PresentationTRANSCRIPT
Advanced Algorithms Advanced Algorithms and Models for and Models for
Computational BiologyComputational Biology-- a machine learning approach-- a machine learning approach
Population Genetics:Population Genetics:
More on halpltypesMore on halpltypes--- coalescence, multi-population phasing, blocks, --- coalescence, multi-population phasing, blocks,
etc.etc.
Eric XingEric Xing
Lecture 19, March 27, 2006
Clustering
How to label them ? inference
How many clusters ??? model selection ? or inference ?
Genetic Demography
Are there genetic prototypes among them ? What are they ? How many ? (how many ancestors do we have ?)
Inference done separately, or jointly?
Multi-population Genetic Demography
Clustering as Mixture Modeling
Model selection "intelligent" guess: ??? cross validation: data-hungry information theoretic:
AIC TIC MDL :
Posterior inference:
we want to handle uncertainty of model complexity explicitly
we favor a distribution that does not constrain M in a "closed" space!
Model Selection vs. Posterior Inference
),ˆ|(|)(minarg KKL MLgf
)()|()|( MpMDpDMp
K,M
Parsimony, Ockam's RazorParsimony, Ockam's Razor
C T A
T G A
C G A
T T A
??????
haplotype h(h1, h2)possible associations of alleles to chromosomes
Heterozygousdiploid individual
C T A
T G ACp
Cm
Genotype gpairs of alleles, whose
associations to chromosomes are unknown
ATGCsequencing
TC TG AA
This is a mixture modeling problem!
Haplotype Ambiguity
The probability of a genotype g:
Standard settings: H| = K << 2J fixed-sized population haplotype pool
p(h1,h2)= p(h1)p(h2)=f1f2 Hardy-Weinberg equilibrium
Problem: K ? H ?
,
2121
21
),|(),()(Hhh
hhgphhpgp
Genotypingmodel
Haplotypemodel
Population haplotypepool
A Finite (Mixture of ) Allele Model
Gn
Hn1 Hn2
Present
Time
22 individuals
The coalescent process
Present
Time
22 individuals
18 ancestors
The coalescent process
Present
Time
22 individuals
18 ancestors
16 ancestors
The coalescent process
Present
Time
22 individuals
18 ancestors
16 ancestors
14 ancestors
The coalescent process
Present
Time
22 individuals
18 ancestors
16 ancestors
14 ancestors
12 ancestors
9 ancestors
8 ancestors
8 ancestors
7 ancestors
7 ancestors
5 ancestors
5 ancestors
3 ancestors
3 ancestors
3 ancestors
2 ancestors
2 ancestors
1 ancestor
Present
Time
Most recent common ancestor(MRCA)
Mutational events can now be added to the genealogical tree, resulting in polymorphic sites. If these sites are typed in the modern sample, they can be used to split the sample into sub-clades (represented by different colours)
Present
Time
Most recent common ancestor(MRCA)
TCGAGGTATTAACTCTAGGTATTAACTCGAGGCATTAACTCTAGGTGTTAACTCGAGGTATTAGCTCTAGGTATCAAC * ** * *
Present
Time
TCGAGGTATTAACTCTAGGTATTAACTCGAGGCATTAACTCTAGGTGTTAACTCGAGGTATTAGCTCTAGGTATCAAC * ** * *
Assuming there are presently k active lineages: The probability of coalescence:
The probability of mutation (killing a lineage):
11
kk
1k
Population Genetic Basis of an Infinite Allele model
≈
Natural genealogy
N
Infinite mixturesCoalescent with mutation
Kingman coalescent process with binary lineage merging Kingman coalescent process with binary lineage merging New population haplotype alleles emerge along all branches of the New population haplotype alleles emerge along all branches of the
coalescence tree at rate coalescence tree at rate /2 per unit length/2 per unit length
The Ewens Sampling Formula: an exchangeable random partition of The Ewens Sampling Formula: an exchangeable random partition of individualsindividuals
Dirichlet process mixtureDirichlet process mixture
A Hierarchical Bayesian Infinite Allele model
).},{|(~ kahph
• Assume anAssume an individual haplotypeindividual haplotype hh is stochastically is stochastically derived from aderived from a population haplotypepopulation haplotype ak withwith
nucleotide-substitution frequencynucleotide-substitution frequency k: :
• Not knowing the correspondences between individual Not knowing the correspondences between individual and population haplotypes, each individual haplotype and population haplotypes, each individual haplotype is a mixture of population haplotypesis a mixture of population haplotypes..
• The number and identity of the population haplotypes are unknownThe number and identity of the population haplotypes are unknown
use ause a Dirichlet Process Dirichlet Process to construct a priorto construct a prior distributiondistribution GG on on HH××RRJJ..
Gn
Hn1 Hn2
Ak k
G
G0
Dirichlet Process Priors
A c.d.f., G onfollows a Dirichlet Process if for any measurable finite partition of (B1,B2, .., Bm), of , the joint distribution of the random variables
(G(B1), G(B2), …, G(Bm)) ~ Dirichlet(G0(B1), …., G0(Bm)),
where, G0 is a the base distribution and a is the precision parameter
Stick-breaking Process
G0
0 0.4 0.4
0.6 0.5 0.3
0.3 0.8 0.24
),Beta(~
)-(
~
)(
∏
∑
∑
-
∞
∞
1
1
1
1
1
1
0
1
k
k
jkkk
kk
k
kkk
G
G
Location
Mass
Chinese Restaurant Process
CRP defines an exchangeable distribution on partitions over an (infinite) sequence of integers
=)|=( -ii kcP c 1 0 0
+1
10
+1
+2
1
+2
1
+2
+3
1
+3
2
+3
1-+1
i
m
1-+2
i
m
1-+
i....
1 2
{A} {A} {A} {A} {A} {A} ……3
12 4
56 7
8 9
The DP Mixture of Ancestral Haplotypes
The customers around a table form a cluster associate a mixture component (i.e., a population haplotype) with a table
sample {a, } at each table from a base measure G0 to obtain the population haplotype and nucleotide substitution frequency for that component
With p(h|{}) and p(g|h1,h2), the CRP yields a posterior distribution on the number of population haplotypes (and on the haplotype configurations and the nucleotide substitution frequencies)
DP-haplotyper
Inference: Markov Chain Monte Carlo (MCMC) Gibbs sampling Metropolis Hasting
Gn
Hn1 Hn2
A
N
K
G
G0 DP
infinite mixture components(for population haplotypes)
Likelihood model(for individual
haplotypes and genotypes)
Model components
Choice of base measure:
Nucleotide-substitution model:
Noisy genotyping model:
j
jaG )Beta()Unif(~ 0
jkjijk
jkjijkjkjkji
jjkjkjiki
ah
ahahp
ahpahp
,,,
,,,
,,,
,,,
if
if ),|( where
),|()},{|(
1
jijiji
jijiji
jijiji
jjijijiiii
ghh
ghhhhgp
hhgphhgp
,,,
,,,
,,,
,,,
if 2
if ),|( where
),|(),|(
21
21
21
2121
1
Gibbs sampling
Starting from some initial haplotype reconstruction H(0) , pick a first table with an arbitrary a1
(0) , and form initial population-hap pool A(0) ={a1
(0) }:
i) Choose an individual i and one of his/her two haplytopes t, uniformly and at random, from all ambiguous individuals;
ii) Sample from , update ;
iii) Sample , where , from ;
update A(t+1) ;
iii) Sample from , update H(t+1).
),,|( )()()()1( ttti
ti Hccp
ttA
)1( tit
c
)1( tka )1( t
itck ) s.t. |( )1(
')('
)1(
''kchap t
iti
tk tt
)1( tit
h ),,|( )1()()1()1(
tti
ti
ti ttt
Hchp A
)1( tc
Convergence of Ancestral Inference
Haplotyping Error
The Gabriel data
Multi-population Genetic Gemography
Pool everything together and solve 1 hap problem? --- ignore population structures
Solve 4 hap problems separately? --- data fragmentation
Co-clustering … solve 4 coupled hap problems jointly
Why humans are so similar and polymorphic patterns are regional
Population bottleneck: a small population that interbred reduced the genetic variation
Out of Africa ~ 100,000 years ago
Out of Africa
Each population can be associated with a unique DP capturing population-specific genetic demography
Different population may have unique haplotypes
Different population may share
common haplotypes
Thus Population specific DPs
are marginally dependent
Population Specific DPs
GG11
GG22
GG44
GG33
Hierarchical DP Mixture
GG11
GG22
GG44
GG33
....
HH
2
1
3
4
Simulated Data
As 1 HapTyping problem
As 5 HapTyping problems
HapMap Data
The block structure of haplotypes
The Daly et al (2001) data set This consists of 103 common SNPs (>5% minor allele frequency) in a 500 kb region
implicated in Crohn disease, genotyped in 129 trios (mom, pop, kid) from a European derived population, giving 258 transmitted and 258 untransmitted chromosomes.
The haplotype blocks span up to 100kb and contain 5 or more common SNPs. For example, one 84 kb block of 8 SNPs shows just two distinct haplotypes accounting for 95% of the observed chromosomes.
Another study: the Patil et al data
The haplotype patterns for 20 independent globally diverse chromosomes defined by 147 common human chr 21 SNPs spanning 106 kb of genomic sequence. Each row represents an SNP. Blue box = major, yellow = minor allele. Each column represents a single
chromosome.
The 147 SNPs are divided into 18 blocks defined by black lines. The expanded box on the right is an SNP
block of 26 SNPs over 19kb of genomic DNA. The 4 most common of 7 different haplotypes
include 80% of the chromosomes, and can be distinguished with 2 SNPs.
Figure 2 of Patil et al
Haplotype Blocks
Markov Dependence Between Blocks
On any chromosome, the haplotype carried at the kth block is drawn from A(k) according to unknown probabilities that depend on the haplotype at block k − 1.
Block Partitioning Algorithms
Greedy algorithm (Patil et al 2001). Begin by considering all possible blocks of ≥1 consecutive SNPs. Next, exclude all blocks in which < 80% of the chromosomes in the data are defined
by haplotypes represented more than once in the block (80% coverage). Considering the remaining overlapping blocks simultaneously, select the one which
maximizes the ratio of total SNPs in the block to the number required to uniquely discriminate haplotypes represented more than once in the block. Any of the remaining blocks that physically overlap with the selected block are discarded, and the process repeated until we have selected a set of contiguous, non-overlapping blocks that cover the 32.4 Mb of chr 21 2ith no gaps and with every SNP assigned to a block.
Hidden Markov Models Maximum a posterior inference (i.e., viterbi) (Daly et al. 2001) Minimum description length (Anderson et al.2003)
Dynamic programming (Sun et al. 2002)
Haplotypes are Shaped by Recombination
Inheritance of unknown generation
....
Hidden Markov DP for Recombination
HH
y1 y2 y3 yN…
....
HH
y1y1 y2y2 y3y3 yNyN…
....
Reference
E.P. Xing, R. Sharan and M.I Jordan, Bayesian Haplotype Inference via the Dirichlet Process. Proceedings of the 21st International Conference on Machine Learning (ICML2004),
N Patil et al . Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21 Science 294 2001:1719-1723.
M J Daly et al . High-resolution haplotype structure in the human genome Nat. Genet. 29 2001: 229-232
Anderson, E.C., Novembre, J. (2003) "Finding haplotype block boundaries using the minimum description length principle." American Journal of Human Genetics 73(2):336-354.