guenomu
Software and Model
Leonardo de O. Martins
University of Vigo
May, 16th 2013
Leo Martins (U Vigo) guenomu software 2013/5/16 1 / 15
Outline
1 The Model
2 The Sampling
3 The Code
Leo Martins (U Vigo) guenomu software 2013/5/16 2 / 15
Hierarchical Bayesian model
P(S ,Θ | D) ∝ P(θ0)P( ~λ0)P(α0)P(S)×
×N∏i=1
P(Di | Gi , ~θi )P(~θi | θ0)P(Gi | ~λi , ~wi ,S)P(~λi | ~λ0)P(~wi | αi )P(αi | α0)
Leo Martins (U Vigo) guenomu software 2013/5/16 3 / 15
The mixture of distance distributions
P(G | ~λ, ~w , S) =
w1e−(dDUPS (G ,S)/λDUPS+dLOSS (G ,S)/λLOSS ) + w2e−(dILS (G ,S)/λILS ) + w3e−(dRF (G ,S)/λRF )
Z(~λ, ~w , S)
wi ∼ Gamma(αgene , 1)
λx ∼ Exp(Λx )
each gene has its own set of wi and λi
the distances dx (G , S) are scaled to account for different gene family sizes
Leo Martins (U Vigo) guenomu software 2013/5/16 4 / 15
The mixture of distance distributions
P(G | ~λ, ~w , S) =
w1e−(dDUPS (G ,S)/λDUPS+dLOSS (G ,S)/λLOSS ) + w2e−(dILS (G ,S)/λILS ) + w3e−(dRF (G ,S)/λRF )
Z(~λ, ~w , S)
wi ∼ Gamma(αgene , 1)
λx ∼ Exp(Λx )
each gene has its own set of wi and λi
the distances dx (G , S) are scaled to account for different gene family sizes
Leo Martins (U Vigo) guenomu software 2013/5/16 4 / 15
The mixture of distance distributions
P(G | ~λ, ~w , S) =
w1e−(dDUPS (G ,S)/λDUPS+dLOSS (G ,S)/λLOSS ) + w2e−(dILS (G ,S)/λILS ) + w3e−(dRF (G ,S)/λRF )
Z(~λ, ~w , S)
wi ∼ Gamma(αgene , 1)
λx ∼ Exp(Λx )
each gene has its own set of wi and λi
the distances dx (G , S) are scaled to account for different gene family sizes
Leo Martins (U Vigo) guenomu software 2013/5/16 4 / 15
The mixture of distance distributions
P(G | ~λ, ~w , S) =
w1e−(dDUPS (G ,S)/λDUPS+dLOSS (G ,S)/λLOSS ) + w2e−(dILS (G ,S)/λILS ) + w3e−(dRF (G ,S)/λRF )
Z(~λ, ~w , S)
wi ∼ Gamma(αgene , 1)
λx ∼ Exp(Λx )
each gene has its own set of wi and λi
the distances dx (G , S) are scaled to account for different gene family sizes
Leo Martins (U Vigo) guenomu software 2013/5/16 4 / 15
The mixture of distance distributions
P(G | ~λ, ~w , S) =
w1e−(dDUPS (G ,S)/λDUPS+dLOSS (G ,S)/λLOSS ) + w2e−(dILS (G ,S)/λILS ) + w3e−(dRF (G ,S)/λRF )
Z(~λ, ~w , S)
wi ∼ Gamma(αgene , 1)
λx ∼ Exp(Λx )
each gene has its own set of wi and λi
the distances dx (G , S) are scaled to account for different gene family sizes
Leo Martins (U Vigo) guenomu software 2013/5/16 4 / 15
Outline
1 The Model
2 The Sampling
3 The Code
Leo Martins (U Vigo) guenomu software 2013/5/16 5 / 15
Doubly-intractable distributions
π(y | θ) =qθ(y)
Z (θ)=
eθts(y)
Z (θ); Z (θ) =
∑y
eθts(y) (1)
augmented distribution: π(θ′, y ′, θ | y) ∝ π(y | θ)π(θ)h(θ′ | θ)π(y ′ | θ′)
Gibbs update of the auxiliary variables θ′,y ′:
I. draw θ′ ∼ h(· | θ)II. draw y ′ ∼ π(· | θ′)
exchange ratio from θ to θ′
min
{1,
qθ(y ′)π(θ′)h(θ | θ′)qθ′(y)
qθ(y)π(θ)h(θ′ | θ)qθ′(y ′)
}(2)
We draw y ′ (the gene tree) through a secondary MCMC starting at itscurrent value
Leo Martins (U Vigo) guenomu software 2013/5/16 6 / 15
Doubly-intractable distributions
π(y | θ) =qθ(y)
Z (θ)=
eθts(y)
Z (θ); Z (θ) =
∑y
eθts(y) (1)
augmented distribution: π(θ′, y ′, θ | y) ∝ π(y | θ)π(θ)h(θ′ | θ)π(y ′ | θ′)Gibbs update of the auxiliary variables θ′,y ′:
I. draw θ′ ∼ h(· | θ)II. draw y ′ ∼ π(· | θ′)
exchange ratio from θ to θ′
min
{1,
qθ(y ′)π(θ′)h(θ | θ′)qθ′(y)
qθ(y)π(θ)h(θ′ | θ)qθ′(y ′)
}(2)
We draw y ′ (the gene tree) through a secondary MCMC starting at itscurrent value
Leo Martins (U Vigo) guenomu software 2013/5/16 6 / 15
Doubly-intractable distributions
π(y | θ) =qθ(y)
Z (θ)=
eθts(y)
Z (θ); Z (θ) =
∑y
eθts(y) (1)
augmented distribution: π(θ′, y ′, θ | y) ∝ π(y | θ)π(θ)h(θ′ | θ)π(y ′ | θ′)Gibbs update of the auxiliary variables θ′,y ′:
I. draw θ′ ∼ h(· | θ)
II. draw y ′ ∼ π(· | θ′)exchange ratio from θ to θ′
min
{1,
qθ(y ′)π(θ′)h(θ | θ′)qθ′(y)
qθ(y)π(θ)h(θ′ | θ)qθ′(y ′)
}(2)
We draw y ′ (the gene tree) through a secondary MCMC starting at itscurrent value
Leo Martins (U Vigo) guenomu software 2013/5/16 6 / 15
Doubly-intractable distributions
π(y | θ) =qθ(y)
Z (θ)=
eθts(y)
Z (θ); Z (θ) =
∑y
eθts(y) (1)
augmented distribution: π(θ′, y ′, θ | y) ∝ π(y | θ)π(θ)h(θ′ | θ)π(y ′ | θ′)Gibbs update of the auxiliary variables θ′,y ′:
I. draw θ′ ∼ h(· | θ)II. draw y ′ ∼ π(· | θ′)
exchange ratio from θ to θ′
min
{1,
qθ(y ′)π(θ′)h(θ | θ′)qθ′(y)
qθ(y)π(θ)h(θ′ | θ)qθ′(y ′)
}(2)
We draw y ′ (the gene tree) through a secondary MCMC starting at itscurrent value
Leo Martins (U Vigo) guenomu software 2013/5/16 6 / 15
Doubly-intractable distributions
π(y | θ) =qθ(y)
Z (θ)=
eθts(y)
Z (θ); Z (θ) =
∑y
eθts(y) (1)
augmented distribution: π(θ′, y ′, θ | y) ∝ π(y | θ)π(θ)h(θ′ | θ)π(y ′ | θ′)Gibbs update of the auxiliary variables θ′,y ′:
I. draw θ′ ∼ h(· | θ)II. draw y ′ ∼ π(· | θ′)
exchange ratio from θ to θ′
min
{1,
qθ(y ′)π(θ′)h(θ | θ′)qθ′(y)
qθ(y)π(θ)h(θ′ | θ)qθ′(y ′)
}(2)
We draw y ′ (the gene tree) through a secondary MCMC starting at itscurrent value
Leo Martins (U Vigo) guenomu software 2013/5/16 6 / 15
Doubly-intractable distributions
π(y | θ) =qθ(y)
Z (θ)=
eθts(y)
Z (θ); Z (θ) =
∑y
eθts(y) (1)
augmented distribution: π(θ′, y ′, θ | y) ∝ π(y | θ)π(θ)h(θ′ | θ)π(y ′ | θ′)Gibbs update of the auxiliary variables θ′,y ′:
I. draw θ′ ∼ h(· | θ)II. draw y ′ ∼ π(· | θ′)
exchange ratio from θ to θ′
min
{1,
qθ(y ′)π(θ′)h(θ | θ′)qθ′(y)
qθ(y)π(θ)h(θ′ | θ)qθ′(y ′)
}(2)
We draw y ′ (the gene tree) through a secondary MCMC starting at itscurrent value
Leo Martins (U Vigo) guenomu software 2013/5/16 6 / 15
Species tree proposal with the exchange algorithm
Leo Martins (U Vigo) guenomu software 2013/5/16 7 / 15
Species tree proposal with the exchange algorithm
Leo Martins (U Vigo) guenomu software 2013/5/16 7 / 15
Species tree proposal with the exchange algorithm
Leo Martins (U Vigo) guenomu software 2013/5/16 7 / 15
Species tree proposal with the exchange algorithm
Leo Martins (U Vigo) guenomu software 2013/5/16 7 / 15
Generalized Multiple-Try Metropolis
MH: sample y , decide if accept it with probability r
r =π(y)
π(x)
q(y , x)
q(x , y)=π(y)
π(x)
p(x | y)
p(y | x)
MTM: choose y among several samples, according to their relative weights
r =w(y1, x) + · · ·+ w(yk , x)
w(x∗1 , y) + · · ·+ w(x∗k , y)
where w(x , y) = π(x)q(x , y)λ(x , y) = π(x)p(y | x)λ(x , y)
GMTM: weights w(.) do not need to represent probability distributions.
r =π(y)pk(x | y)
π(x)pk(y | x)
Wx
Wy
where Wy = wi (yi ,x)∑kj=1 wj (yj ,x)
for the chosen element i
Leo Martins (U Vigo) guenomu software 2013/5/16 8 / 15
Generalized Multiple-Try Metropolis
MH: sample y , decide if accept it with probability r
r =π(y)
π(x)
q(y , x)
q(x , y)=π(y)
π(x)
p(x | y)
p(y | x)
MTM: choose y among several samples, according to their relative weights
r =w(y1, x) + · · ·+ w(yk , x)
w(x∗1 , y) + · · ·+ w(x∗k , y)
where w(x , y) = π(x)q(x , y)λ(x , y) = π(x)p(y | x)λ(x , y)
GMTM: weights w(.) do not need to represent probability distributions.
r =π(y)pk(x | y)
π(x)pk(y | x)
Wx
Wy
where Wy = wi (yi ,x)∑kj=1 wj (yj ,x)
for the chosen element i
Leo Martins (U Vigo) guenomu software 2013/5/16 8 / 15
Generalized Multiple-Try Metropolis
MH: sample y , decide if accept it with probability r
r =π(y)
π(x)
q(y , x)
q(x , y)=π(y)
π(x)
p(x | y)
p(y | x)
MTM: choose y among several samples, according to their relative weights
r =w(y1, x) + · · ·+ w(yk , x)
w(x∗1 , y) + · · ·+ w(x∗k , y)
where w(x , y) = π(x)q(x , y)λ(x , y) = π(x)p(y | x)λ(x , y)
GMTM: weights w(.) do not need to represent probability distributions.
r =π(y)pk(x | y)
π(x)pk(y | x)
Wx
Wy
where Wy = wi (yi ,x)∑kj=1 wj (yj ,x)
for the chosen element i
Leo Martins (U Vigo) guenomu software 2013/5/16 8 / 15
gene tree proposal with GMTM or MTM
Leo Martins (U Vigo) guenomu software 2013/5/16 9 / 15
gene tree proposal with GMTM or MTM
Leo Martins (U Vigo) guenomu software 2013/5/16 9 / 15
gene tree proposal with GMTM or MTM
Leo Martins (U Vigo) guenomu software 2013/5/16 9 / 15
Outline
1 The Model
2 The Sampling
3 The Code
Leo Martins (U Vigo) guenomu software 2013/5/16 10 / 15
RF distance, Assignment cost (Hdist)
Leo Martins (U Vigo) guenomu software 2013/5/16 11 / 15
RF distance, Assignment cost (Hdist)
Leo Martins (U Vigo) guenomu software 2013/5/16 11 / 15
A parallel pseudo-random number generator (PRNG)
Given a seed and an algorithm, we have a stream of PRNs.
PRNG1
PRNG2
PRNG2
PRNG2
PRNG2
x1
seed
x2
x3
x4
x11 x12
Leo Martins (U Vigo) guenomu software 2013/5/16 12 / 15
A parallel pseudo-random number generator (PRNG)
Given a seed and an algorithm, we have a stream of PRNs.
PRNG1
PRNG2
PRNG2
PRNG2
PRNG2
x1
seed
x2
x3
x4
x11 x12
Using a second algorithm, the firststream will give us a sequence ofseeds. We use the 150 parametersets for the Tausworthe (LFSR)generators (L’ecuyer, Maths Comput1999, pp.261).Therefore, given the seed, we canpredict all states of all streams.
Leo Martins (U Vigo) guenomu software 2013/5/16 12 / 15
A parallel pseudo-random number generator (PRNG)
In our gene/species model:
PRNG1
PRNG2
PRNG2
PRNG2
PRNG2
x1
seed
x2
x3
x4
x11 x12
we split gene families among jobs
all jobs receive seed (broadcast)and therefore can reproduce thesame x1. That’s cheaper thancommunicating the states.
each job uses its own x(i+1) forsampling new gene trees etc. andcan work in parallel. They use thecommon x1 for sampling e.g. newspecies tree, which needssynchronization.
the only thing that must be sharedis thus the proposal values(AllReduce) when updating”global” parameters”, so that alljobs can make the sameacceptance/rejection decision.
Leo Martins (U Vigo) guenomu software 2013/5/16 13 / 15
A parallel pseudo-random number generator (PRNG)
In our gene/species model:
PRNG1
PRNG2
PRNG2
PRNG2
PRNG2
x1
seed
x2
x3
x4
x11 x12
we split gene families among jobs
all jobs receive seed (broadcast)and therefore can reproduce thesame x1. That’s cheaper thancommunicating the states.
each job uses its own x(i+1) forsampling new gene trees etc. andcan work in parallel. They use thecommon x1 for sampling e.g. newspecies tree, which needssynchronization.
the only thing that must be sharedis thus the proposal values(AllReduce) when updating”global” parameters”, so that alljobs can make the sameacceptance/rejection decision.
Leo Martins (U Vigo) guenomu software 2013/5/16 13 / 15
A parallel pseudo-random number generator (PRNG)
In our gene/species model:
PRNG1
PRNG2
PRNG2
PRNG2
PRNG2
x1
seed
x2
x3
x4
x11 x12
we split gene families among jobs
all jobs receive seed (broadcast)and therefore can reproduce thesame x1. That’s cheaper thancommunicating the states.
each job uses its own x(i+1) forsampling new gene trees etc. andcan work in parallel. They use thecommon x1 for sampling e.g. newspecies tree, which needssynchronization.
the only thing that must be sharedis thus the proposal values(AllReduce) when updating”global” parameters”, so that alljobs can make the sameacceptance/rejection decision.
Leo Martins (U Vigo) guenomu software 2013/5/16 13 / 15
A parallel pseudo-random number generator (PRNG)
In our gene/species model:
PRNG1
PRNG2
PRNG2
PRNG2
PRNG2
x1
seed
x2
x3
x4
x11 x12
we split gene families among jobs
all jobs receive seed (broadcast)and therefore can reproduce thesame x1. That’s cheaper thancommunicating the states.
each job uses its own x(i+1) forsampling new gene trees etc. andcan work in parallel. They use thecommon x1 for sampling e.g. newspecies tree, which needssynchronization.
the only thing that must be sharedis thus the proposal values(AllReduce) when updating”global” parameters”, so that alljobs can make the sameacceptance/rejection decision.
Leo Martins (U Vigo) guenomu software 2013/5/16 13 / 15
Each job looks like an independent analysis
Leo Martins (U Vigo) guenomu software 2013/5/16 14 / 15
https://bitbucket.org/leomrtns/guenomu
Leo Martins (U Vigo) guenomu software 2013/5/16 15 / 15