18.1 log-likelihood gradient - university at buffalosrihari/cse676/18.1 log... · the...

10
Deep Learning Srihari 1 The Log-likelihood Gradient Sargur N. Srihari [email protected]

Upload: others

Post on 11-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 18.1 Log-likelihood Gradient - University at Buffalosrihari/CSE676/18.1 Log... · The Log-likelihood Gradient Sargur N. Srihari srihari@cedar.buffalo.edu. Deep Learning Srihari Topics

Deep Learning Srihari

1

The Log-likelihood Gradient

Sargur N. [email protected]

Page 2: 18.1 Log-likelihood Gradient - University at Buffalosrihari/CSE676/18.1 Log... · The Log-likelihood Gradient Sargur N. Srihari srihari@cedar.buffalo.edu. Deep Learning Srihari Topics

Deep Learning Srihari

Topics• Definition of Partition Function1.The log-likelihood gradient2.Stochastic maximum likelihood and

contrastive divergence3.Pseudolikelihood4.Score matching and Ratio matching5.Denoising score matching6.Noise-contrastive estimation7.Estimating the partition function

2

Page 3: 18.1 Log-likelihood Gradient - University at Buffalosrihari/CSE676/18.1 Log... · The Log-likelihood Gradient Sargur N. Srihari srihari@cedar.buffalo.edu. Deep Learning Srihari Topics

Deep Learning Srihari

Undirected models in deep learning• pmodel(x) is an undirected model

• We study how the parameters are to be determined 3

A deep Boltzmann machineA Restricted Boltzmann machine

Page 4: 18.1 Log-likelihood Gradient - University at Buffalosrihari/CSE676/18.1 Log... · The Log-likelihood Gradient Sargur N. Srihari srihari@cedar.buffalo.edu. Deep Learning Srihari Topics

Deep Learning Srihari

Finding most likely parameters 𝛳

• Task of interest: – Determine parameters θ of a Gibbs distribution

• where is the partition function

• Learning undirected model by MLE is difficult because partition function depends on parameters

• First recall the Maximum Likelihood principle4

p(x;θ) = 1

Z(θ)!p(x,θ)

Z(θ) = !p(x,θ)

x∑

Page 5: 18.1 Log-likelihood Gradient - University at Buffalosrihari/CSE676/18.1 Log... · The Log-likelihood Gradient Sargur N. Srihari srihari@cedar.buffalo.edu. Deep Learning Srihari Topics

Deep Learning SrihariMaximum Likelihood Expression• Given m i.i.d. examples X={x(1), x(2),..x(m)}

– From true but unknown distribution pdata(x)• Let pmodel(x ; θ) be parametric indexed by θ

• i.e., pmodel(x;θ) maps any x to the true probability pdata(x)– MLE for θ is:

• Equivalently, by taking logarithms• Replacing summation with expectation

• Which is solved using gradient descent• Where g is gradient with terms

θ ← θ + εg

gm= ∇θ log p(x

(m);θ) = ∇θ log !p(x(m);θ) − ∇θ logZ(θ)

Page 6: 18.1 Log-likelihood Gradient - University at Buffalosrihari/CSE676/18.1 Log... · The Log-likelihood Gradient Sargur N. Srihari srihari@cedar.buffalo.edu. Deep Learning Srihari Topics

Deep Learning Srihari

Gradient has two phases• Positive and Negative phases of learning

– Gradient of the log-likelihood wrt parameters has a term corresponding to gradient of partition function

6

∇θ log p(x;θ) = ∇θ log !p(x;θ)− ∇θ logZ(θ)

p(x;θ) = 1

Z(θ)!p(x,θ)

Page 7: 18.1 Log-likelihood Gradient - University at Buffalosrihari/CSE676/18.1 Log... · The Log-likelihood Gradient Sargur N. Srihari srihari@cedar.buffalo.edu. Deep Learning Srihari Topics

Deep Learning Srihari

Tractability: Positive, Negative phases• For most undirected models: negative phase

is difficult• Models with no latent variables or few

interactions between latent variables have a tractable positive phase– RBM: straight-forward positive phase, difficult

negative phase

• This chapter is about difficulties with the negative phase

7

∇θ logZ(θ)

Page 8: 18.1 Log-likelihood Gradient - University at Buffalosrihari/CSE676/18.1 Log... · The Log-likelihood Gradient Sargur N. Srihari srihari@cedar.buffalo.edu. Deep Learning Srihari Topics

Deep Learning Srihari

Computing Gradient for Negative Phase

• For models that guarantee p(x) > 0 for all x we can substitute for

Derivation made use of summation over discrete xSimilar result applies using integration over continuous xIn the continuous version we use Leibniz rule for differentiation

Page 9: 18.1 Log-likelihood Gradient - University at Buffalosrihari/CSE676/18.1 Log... · The Log-likelihood Gradient Sargur N. Srihari srihari@cedar.buffalo.edu. Deep Learning Srihari Topics

Deep Learning Srihari

MLE for RBM

p(x;θ) = 1

Z(θ)!p(x,θ)

Z(θ) = !p(x,θ)

x∑

gm= ∇θ log p(x

(m);θ) = ∇θ log !p(x(m);θ) − ∇θ logZ(θ)

For an RBM: x={v,h}

bj

ai

h

v

W

E(v,h) = −hTWv −aTv −bTh = Wi,jvi

j∑

i∑ h

j− a

ivi

i∑ − b

jj∑ h

j

p(v,h) = 1

Zexp(−E(v,h))

θ={W,a,b}

L({x (1),..x (M )};θ) = log !p(x (m);θ)m∑ − logZ(θ)

m∑

∇θ logZ(θ) = Ex~p(x )∇θ log !p(x)

Ex~p(x )

∇θ log !p(x) =1M

∇θi=1

M

∑ log !p(x (m);θ)

Binaryunits

Binaryunits

Connectionsbias

Determine parameters θ that maximize log-likelihood (negative loss)

maxθL({x (1),..x (M )};θ) = log p(x (m)

m∑ ;θ)

IntractablePartition function

∂∂W

i,j

E(v,h) = −vih

j

Probability Distribution of Undirected model (Gibbs)

An identity

For stochastic gradient ascent, take derivatives:

Derivative of negative phase:Derivative of positive phase:

1M

∇θm=1

M

∑ log !p(x (m);θ)

Summation is over samplesfrom the training setSince it is summed m times 1/m has no effect

Summation is over samples from the RBM

RBM

θ ← θ + εg

!p(x) = exp(−E(v,h))

Z = exp

v,h∑ (−E(v,h))

Page 10: 18.1 Log-likelihood Gradient - University at Buffalosrihari/CSE676/18.1 Log... · The Log-likelihood Gradient Sargur N. Srihari srihari@cedar.buffalo.edu. Deep Learning Srihari Topics

Deep Learning Srihari

Monte Carlo methods

• The identity– is the basis for Monte Carlo methods for MLE of

models with intractable partition functions• MC approach provides intuitive framework for

both positive and negative phases– Positive phase: increase for x drawn from data– Negative phase: decrease partition function by

decreasing drawn from the model distribution

10

∇θ logZ(θ) = Ex~p(x )

∇θ log !p(x)

log !p(x)

log !p(x)