18.1 log-likelihood gradient - university at buffalosrihari/cse676/18.1 log... · the...
TRANSCRIPT
Deep Learning Srihari
Topics• Definition of Partition Function1.The log-likelihood gradient2.Stochastic maximum likelihood and
contrastive divergence3.Pseudolikelihood4.Score matching and Ratio matching5.Denoising score matching6.Noise-contrastive estimation7.Estimating the partition function
2
Deep Learning Srihari
Undirected models in deep learning• pmodel(x) is an undirected model
• We study how the parameters are to be determined 3
A deep Boltzmann machineA Restricted Boltzmann machine
Deep Learning Srihari
Finding most likely parameters 𝛳
• Task of interest: – Determine parameters θ of a Gibbs distribution
• where is the partition function
• Learning undirected model by MLE is difficult because partition function depends on parameters
• First recall the Maximum Likelihood principle4
p(x;θ) = 1
Z(θ)!p(x,θ)
Z(θ) = !p(x,θ)
x∑
Deep Learning SrihariMaximum Likelihood Expression• Given m i.i.d. examples X={x(1), x(2),..x(m)}
– From true but unknown distribution pdata(x)• Let pmodel(x ; θ) be parametric indexed by θ
• i.e., pmodel(x;θ) maps any x to the true probability pdata(x)– MLE for θ is:
• Equivalently, by taking logarithms• Replacing summation with expectation
• Which is solved using gradient descent• Where g is gradient with terms
θ ← θ + εg
gm= ∇θ log p(x
(m);θ) = ∇θ log !p(x(m);θ) − ∇θ logZ(θ)
Deep Learning Srihari
Gradient has two phases• Positive and Negative phases of learning
– Gradient of the log-likelihood wrt parameters has a term corresponding to gradient of partition function
6
∇θ log p(x;θ) = ∇θ log !p(x;θ)− ∇θ logZ(θ)
p(x;θ) = 1
Z(θ)!p(x,θ)
Deep Learning Srihari
Tractability: Positive, Negative phases• For most undirected models: negative phase
is difficult• Models with no latent variables or few
interactions between latent variables have a tractable positive phase– RBM: straight-forward positive phase, difficult
negative phase
• This chapter is about difficulties with the negative phase
7
∇θ logZ(θ)
Deep Learning Srihari
Computing Gradient for Negative Phase
• For models that guarantee p(x) > 0 for all x we can substitute for
Derivation made use of summation over discrete xSimilar result applies using integration over continuous xIn the continuous version we use Leibniz rule for differentiation
Deep Learning Srihari
MLE for RBM
p(x;θ) = 1
Z(θ)!p(x,θ)
Z(θ) = !p(x,θ)
x∑
gm= ∇θ log p(x
(m);θ) = ∇θ log !p(x(m);θ) − ∇θ logZ(θ)
For an RBM: x={v,h}
bj
ai
h
v
W
E(v,h) = −hTWv −aTv −bTh = Wi,jvi
j∑
i∑ h
j− a
ivi
i∑ − b
jj∑ h
j
p(v,h) = 1
Zexp(−E(v,h))
θ={W,a,b}
L({x (1),..x (M )};θ) = log !p(x (m);θ)m∑ − logZ(θ)
m∑
∇θ logZ(θ) = Ex~p(x )∇θ log !p(x)
Ex~p(x )
∇θ log !p(x) =1M
∇θi=1
M
∑ log !p(x (m);θ)
Binaryunits
Binaryunits
Connectionsbias
Determine parameters θ that maximize log-likelihood (negative loss)
maxθL({x (1),..x (M )};θ) = log p(x (m)
m∑ ;θ)
IntractablePartition function
∂∂W
i,j
E(v,h) = −vih
j
Probability Distribution of Undirected model (Gibbs)
An identity
For stochastic gradient ascent, take derivatives:
Derivative of negative phase:Derivative of positive phase:
1M
∇θm=1
M
∑ log !p(x (m);θ)
Summation is over samplesfrom the training setSince it is summed m times 1/m has no effect
Summation is over samples from the RBM
RBM
θ ← θ + εg
!p(x) = exp(−E(v,h))
Z = exp
v,h∑ (−E(v,h))
Deep Learning Srihari
Monte Carlo methods
• The identity– is the basis for Monte Carlo methods for MLE of
models with intractable partition functions• MC approach provides intuitive framework for
both positive and negative phases– Positive phase: increase for x drawn from data– Negative phase: decrease partition function by
decreasing drawn from the model distribution
10
∇θ logZ(θ) = Ex~p(x )
∇θ log !p(x)
log !p(x)
log !p(x)