lecture 16 deep neural generative models - cmsc 35246...

149
Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Lecture 16 Deep Neural Generative Models CMSC 35246

Upload: others

Post on 18-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Lecture 16Deep Neural Generative Models

CMSC 35246: Deep Learning

Shubhendu Trivedi&

Risi Kondor

University of Chicago

May 22, 2017

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 2: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Approach so far:

We have considered simple models and then constructed theirdeep, non-linear variants

Example: PCA (and Linear Autoencoder) to Nonlinear-PCA(Non-linear (deep?) autoencoders)

Example: Sparse Coding (Sparse Autoencoder with lineardecoding) to Deep Sparse Autoencoders

All the models we have considered so far are completelydeterministic

The encoder and decoders have no stochasticity

We don’t construct a probabilistic model of the data

Can’t sample from the model

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 3: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Approach so far:

We have considered simple models and then constructed theirdeep, non-linear variants

Example: PCA (and Linear Autoencoder) to Nonlinear-PCA(Non-linear (deep?) autoencoders)

Example: Sparse Coding (Sparse Autoencoder with lineardecoding) to Deep Sparse Autoencoders

All the models we have considered so far are completelydeterministic

The encoder and decoders have no stochasticity

We don’t construct a probabilistic model of the data

Can’t sample from the model

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 4: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Approach so far:

We have considered simple models and then constructed theirdeep, non-linear variants

Example: PCA (and Linear Autoencoder) to Nonlinear-PCA(Non-linear (deep?) autoencoders)

Example: Sparse Coding (Sparse Autoencoder with lineardecoding) to Deep Sparse Autoencoders

All the models we have considered so far are completelydeterministic

The encoder and decoders have no stochasticity

We don’t construct a probabilistic model of the data

Can’t sample from the model

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 5: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Approach so far:

We have considered simple models and then constructed theirdeep, non-linear variants

Example: PCA (and Linear Autoencoder) to Nonlinear-PCA(Non-linear (deep?) autoencoders)

Example: Sparse Coding (Sparse Autoencoder with lineardecoding) to Deep Sparse Autoencoders

All the models we have considered so far are completelydeterministic

The encoder and decoders have no stochasticity

We don’t construct a probabilistic model of the data

Can’t sample from the model

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 6: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Approach so far:

We have considered simple models and then constructed theirdeep, non-linear variants

Example: PCA (and Linear Autoencoder) to Nonlinear-PCA(Non-linear (deep?) autoencoders)

Example: Sparse Coding (Sparse Autoencoder with lineardecoding) to Deep Sparse Autoencoders

All the models we have considered so far are completelydeterministic

The encoder and decoders have no stochasticity

We don’t construct a probabilistic model of the data

Can’t sample from the model

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 7: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Approach so far:

We have considered simple models and then constructed theirdeep, non-linear variants

Example: PCA (and Linear Autoencoder) to Nonlinear-PCA(Non-linear (deep?) autoencoders)

Example: Sparse Coding (Sparse Autoencoder with lineardecoding) to Deep Sparse Autoencoders

All the models we have considered so far are completelydeterministic

The encoder and decoders have no stochasticity

We don’t construct a probabilistic model of the data

Can’t sample from the model

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 8: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Approach so far:

We have considered simple models and then constructed theirdeep, non-linear variants

Example: PCA (and Linear Autoencoder) to Nonlinear-PCA(Non-linear (deep?) autoencoders)

Example: Sparse Coding (Sparse Autoencoder with lineardecoding) to Deep Sparse Autoencoders

All the models we have considered so far are completelydeterministic

The encoder and decoders have no stochasticity

We don’t construct a probabilistic model of the data

Can’t sample from the model

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 9: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Representations

Figure: Ruslan Salakhutdinov

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 10: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

To motivate Deep Neural Generative models, like before, let’sseek inspiration from simple linear models first

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 11: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Linear Factor Model

We want to build a probabilistic model of the input P̃ (x)

Like before, we are interested in latent factors h that explain x

We then care about the marginal:

P̃ (x) = EhP̃ (x|h)

h is a representation of the data

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 12: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Linear Factor Model

We want to build a probabilistic model of the input P̃ (x)

Like before, we are interested in latent factors h that explain x

We then care about the marginal:

P̃ (x) = EhP̃ (x|h)

h is a representation of the data

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 13: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Linear Factor Model

We want to build a probabilistic model of the input P̃ (x)

Like before, we are interested in latent factors h that explain x

We then care about the marginal:

P̃ (x) = EhP̃ (x|h)

h is a representation of the data

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 14: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Linear Factor Model

We want to build a probabilistic model of the input P̃ (x)

Like before, we are interested in latent factors h that explain x

We then care about the marginal:

P̃ (x) = EhP̃ (x|h)

h is a representation of the data

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 15: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Linear Factor Model

The latent factors h are an encoding of the data

Simplest decoding model: Get x after a linear transformationof x with some noise

Formally: Suppose we sample the latent factors from adistribution h ∼ P (h)

Then: x = Wh + b + ε

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 16: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Linear Factor Model

The latent factors h are an encoding of the data

Simplest decoding model: Get x after a linear transformationof x with some noise

Formally: Suppose we sample the latent factors from adistribution h ∼ P (h)

Then: x = Wh + b + ε

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 17: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Linear Factor Model

The latent factors h are an encoding of the data

Simplest decoding model: Get x after a linear transformationof x with some noise

Formally: Suppose we sample the latent factors from adistribution h ∼ P (h)

Then: x = Wh + b + ε

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 18: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Linear Factor Model

The latent factors h are an encoding of the data

Simplest decoding model: Get x after a linear transformationof x with some noise

Formally: Suppose we sample the latent factors from adistribution h ∼ P (h)

Then: x = Wh + b + ε

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 19: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Linear Factor Model

P (h) is a factorial distribution

x1 x2 x3 x4 x5

h1 h2 h3

x = Wh + b + ε

How do learn in such a model?

Let’s look at a simple example

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 20: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Probabilistic PCA

Suppose underlying latent factor has a Gaussian distribution

h ∼ N (h; 0, I)

For the noise model: Assume ε ∼ N (0, σ2I)

Then:P (x|h) = N (x|Wh + b, σ2I)

We care about the marginal P (x) (predictive distribution):

P (x) = N (x|b,WW T + σ2I)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 21: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Probabilistic PCA

Suppose underlying latent factor has a Gaussian distribution

h ∼ N (h; 0, I)

For the noise model: Assume ε ∼ N (0, σ2I)

Then:P (x|h) = N (x|Wh + b, σ2I)

We care about the marginal P (x) (predictive distribution):

P (x) = N (x|b,WW T + σ2I)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 22: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Probabilistic PCA

Suppose underlying latent factor has a Gaussian distribution

h ∼ N (h; 0, I)

For the noise model: Assume ε ∼ N (0, σ2I)

Then:P (x|h) = N (x|Wh + b, σ2I)

We care about the marginal P (x) (predictive distribution):

P (x) = N (x|b,WW T + σ2I)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 23: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Probabilistic PCA

Suppose underlying latent factor has a Gaussian distribution

h ∼ N (h; 0, I)

For the noise model: Assume ε ∼ N (0, σ2I)

Then:P (x|h) = N (x|Wh + b, σ2I)

We care about the marginal P (x) (predictive distribution):

P (x) = N (x|b,WW T + σ2I)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 24: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Probabilistic PCA

P (x) = N (x|b,WW T + σ2I)

How do we learn the parameters? (EM, ML Estimation)

Let’s look at the ML Estimation:

Let C = WW T + σ2I

We want to maximize `(θ;X) =∑

i logP (xi|θ)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 25: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Probabilistic PCA

P (x) = N (x|b,WW T + σ2I)

How do we learn the parameters? (EM, ML Estimation)

Let’s look at the ML Estimation:

Let C = WW T + σ2I

We want to maximize `(θ;X) =∑

i logP (xi|θ)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 26: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Probabilistic PCA

P (x) = N (x|b,WW T + σ2I)

How do we learn the parameters? (EM, ML Estimation)

Let’s look at the ML Estimation:

Let C = WW T + σ2I

We want to maximize `(θ;X) =∑

i logP (xi|θ)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 27: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Probabilistic PCA

P (x) = N (x|b,WW T + σ2I)

How do we learn the parameters? (EM, ML Estimation)

Let’s look at the ML Estimation:

Let C = WW T + σ2I

We want to maximize `(θ;X) =∑

i logP (xi|θ)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 28: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Probabilistic PCA: ML Estimation

`(θ;X) =∑i

logP (xi|θ)

= −N2

log |C| − 1

2

∑i

(xi − b)C−1(xi − b)T

= −N2

log |C| − 1

2Tr[(C−1

∑i

xi − b)(xi − b)T ]

=N

2log |C| − 1

2Tr[(C−1S]

Now fit the parameters θ = W,b, σ to maximize log-likelihood

Can also use EM

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 29: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Probabilistic PCA: ML Estimation

`(θ;X) =∑i

logP (xi|θ)

= −N2

log |C| − 1

2

∑i

(xi − b)C−1(xi − b)T

= −N2

log |C| − 1

2Tr[(C−1

∑i

xi − b)(xi − b)T ]

=N

2log |C| − 1

2Tr[(C−1S]

Now fit the parameters θ = W,b, σ to maximize log-likelihood

Can also use EM

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 30: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Probabilistic PCA: ML Estimation

`(θ;X) =∑i

logP (xi|θ)

= −N2

log |C| − 1

2

∑i

(xi − b)C−1(xi − b)T

= −N2

log |C| − 1

2Tr[(C−1

∑i

xi − b)(xi − b)T ]

=N

2log |C| − 1

2Tr[(C−1S]

Now fit the parameters θ = W,b, σ to maximize log-likelihood

Can also use EM

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 31: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Factor Analysis

Fix the latent factor prior to be the unit Gaussian as before:

h ∼ N (h; 0, I)

Noise is sampled from a Gaussian with a diagonal covariance:

Ψ = diag([σ21, σ22, . . . , σ

2d])

Still consider linear relationship between inputs and observedvariables: Marginal P (x) ∼ N (x; b,WW T + Ψ)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 32: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Factor Analysis

Fix the latent factor prior to be the unit Gaussian as before:

h ∼ N (h; 0, I)

Noise is sampled from a Gaussian with a diagonal covariance:

Ψ = diag([σ21, σ22, . . . , σ

2d])

Still consider linear relationship between inputs and observedvariables: Marginal P (x) ∼ N (x; b,WW T + Ψ)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 33: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Factor Analysis

Fix the latent factor prior to be the unit Gaussian as before:

h ∼ N (h; 0, I)

Noise is sampled from a Gaussian with a diagonal covariance:

Ψ = diag([σ21, σ22, . . . , σ

2d])

Still consider linear relationship between inputs and observedvariables: Marginal P (x) ∼ N (x; b,WW T + Ψ)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 34: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Factor Analysis

Fix the latent factor prior to be the unit Gaussian as before:

h ∼ N (h; 0, I)

Noise is sampled from a Gaussian with a diagonal covariance:

Ψ = diag([σ21, σ22, . . . , σ

2d])

Still consider linear relationship between inputs and observedvariables: Marginal P (x) ∼ N (x; b,WW T + Ψ)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 35: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Factor Analysis

On deriving the posterior P (h|x) = N (h|µ,Λ), we get:

µ = W T (WW T + Ψ)−1(x− b)

Λ = I −W T (WW T + Ψ)−1W

Parameters are coupled, makes ML estimation difficult

Need to employ EM (or non-linear optimization)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 36: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Factor Analysis

On deriving the posterior P (h|x) = N (h|µ,Λ), we get:

µ = W T (WW T + Ψ)−1(x− b)

Λ = I −W T (WW T + Ψ)−1W

Parameters are coupled, makes ML estimation difficult

Need to employ EM (or non-linear optimization)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 37: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Factor Analysis

On deriving the posterior P (h|x) = N (h|µ,Λ), we get:

µ = W T (WW T + Ψ)−1(x− b)

Λ = I −W T (WW T + Ψ)−1W

Parameters are coupled, makes ML estimation difficult

Need to employ EM (or non-linear optimization)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 38: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Factor Analysis

On deriving the posterior P (h|x) = N (h|µ,Λ), we get:

µ = W T (WW T + Ψ)−1(x− b)

Λ = I −W T (WW T + Ψ)−1W

Parameters are coupled, makes ML estimation difficult

Need to employ EM (or non-linear optimization)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 39: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Factor Analysis

On deriving the posterior P (h|x) = N (h|µ,Λ), we get:

µ = W T (WW T + Ψ)−1(x− b)

Λ = I −W T (WW T + Ψ)−1W

Parameters are coupled, makes ML estimation difficult

Need to employ EM (or non-linear optimization)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 40: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

More General Models

Suppose P (h) can not be assumed to have a nice Gaussianform

The decoding of the input from the latent states can be acomplicated non-linear function

Estimation and inference can get complicated!

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 41: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

More General Models

Suppose P (h) can not be assumed to have a nice Gaussianform

The decoding of the input from the latent states can be acomplicated non-linear function

Estimation and inference can get complicated!

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 42: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

More General Models

Suppose P (h) can not be assumed to have a nice Gaussianform

The decoding of the input from the latent states can be acomplicated non-linear function

Estimation and inference can get complicated!

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 43: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Earlier we had:

x1 x2 x3 x4 x5

h1 h2 h3

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 44: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Quick Review

Generative models can be modeled as directed graphicalmodels

The nodes represent random variables and arcs indicatedependency

Some of the random variables are observed, others are hidden

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 45: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Quick Review

Generative models can be modeled as directed graphicalmodels

The nodes represent random variables and arcs indicatedependency

Some of the random variables are observed, others are hidden

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 46: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Quick Review

Generative models can be modeled as directed graphicalmodels

The nodes represent random variables and arcs indicatedependency

Some of the random variables are observed, others are hidden

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 47: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Sigmoid Belief Networks

. . .x

. . .h1

. . .h2

. . .h3

Just like a feedfoward network, but with arrows reversed.

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 48: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Sigmoid Belief Networks

Let x = h0. Consider binary activations, then:

P (hki = 1|hk+1) = sigm(bki +∑j

W k+1i,j hk+1

j )

The joint probability factorizes as:

P (x,h1, . . . ,hl) = P (hl)( l−1∏k=1

P (hk|hk+1))P (x|h1)

Marginalization yields P (x), intractable in practice except forvery small models

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 49: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Sigmoid Belief Networks

Let x = h0. Consider binary activations, then:

P (hki = 1|hk+1) = sigm(bki +∑j

W k+1i,j hk+1

j )

The joint probability factorizes as:

P (x,h1, . . . ,hl) = P (hl)( l−1∏k=1

P (hk|hk+1))P (x|h1)

Marginalization yields P (x), intractable in practice except forvery small models

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 50: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Sigmoid Belief Networks

Let x = h0. Consider binary activations, then:

P (hki = 1|hk+1) = sigm(bki +∑j

W k+1i,j hk+1

j )

The joint probability factorizes as:

P (x,h1, . . . ,hl) = P (hl)( l−1∏k=1

P (hk|hk+1))P (x|h1)

Marginalization yields P (x), intractable in practice except forvery small models

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 51: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Sigmoid Belief Networks

Let x = h0. Consider binary activations, then:

P (hki = 1|hk+1) = sigm(bki +∑j

W k+1i,j hk+1

j )

The joint probability factorizes as:

P (x,h1, . . . ,hl) = P (hl)( l−1∏k=1

P (hk|hk+1))P (x|h1)

Marginalization yields P (x), intractable in practice except forvery small models

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 52: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Sigmoid Belief Networks

Let x = h0. Consider binary activations, then:

P (hki = 1|hk+1) = sigm(bki +∑j

W k+1i,j hk+1

j )

The joint probability factorizes as:

P (x,h1, . . . ,hl) = P (hl)( l−1∏k=1

P (hk|hk+1))P (x|h1)

Marginalization yields P (x), intractable in practice except forvery small models

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 53: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Sigmoid Belief Networks

P (x,h1, . . . ,hl) = P (hl)( l−1∏k=1

P (hk|hk+1))P (x|h1)

The top level prior is chosen as factorizable:P (hl) =

∏i P (hli)

A single (Bernoulli) parameter is needed for each hi in case ofbinary units

Deep Belief Networks are like Sigmoid Belief Networks exceptfor the top two layers

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 54: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Sigmoid Belief Networks

P (x,h1, . . . ,hl) = P (hl)( l−1∏k=1

P (hk|hk+1))P (x|h1)

The top level prior is chosen as factorizable:P (hl) =

∏i P (hli)

A single (Bernoulli) parameter is needed for each hi in case ofbinary units

Deep Belief Networks are like Sigmoid Belief Networks exceptfor the top two layers

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 55: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Sigmoid Belief Networks

P (x,h1, . . . ,hl) = P (hl)( l−1∏k=1

P (hk|hk+1))P (x|h1)

The top level prior is chosen as factorizable:P (hl) =

∏i P (hli)

A single (Bernoulli) parameter is needed for each hi in case ofbinary units

Deep Belief Networks are like Sigmoid Belief Networks exceptfor the top two layers

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 56: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Sigmoid Belief Networks

P (x,h1, . . . ,hl) = P (hl)( l−1∏k=1

P (hk|hk+1))P (x|h1)

The top level prior is chosen as factorizable:P (hl) =

∏i P (hli)

A single (Bernoulli) parameter is needed for each hi in case ofbinary units

Deep Belief Networks are like Sigmoid Belief Networks exceptfor the top two layers

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 57: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Sigmoid Belief Networks

General case models are called Helmholtz Machines

Two key references:

• G. E. Hinton, P. Dayan, B. J. Frey, R. M. Neal: TheWake-Sleep Algorithm for Unsupervised Neural Networks,In Science, 1995

• R. M. Neal: Connectionist Learning of Belief Networks,In Artificial Intelligence, 1992

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 58: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Sigmoid Belief Networks

General case models are called Helmholtz Machines

Two key references:

• G. E. Hinton, P. Dayan, B. J. Frey, R. M. Neal: TheWake-Sleep Algorithm for Unsupervised Neural Networks,In Science, 1995

• R. M. Neal: Connectionist Learning of Belief Networks,In Artificial Intelligence, 1992

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 59: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Sigmoid Belief Networks

General case models are called Helmholtz Machines

Two key references:

• G. E. Hinton, P. Dayan, B. J. Frey, R. M. Neal: TheWake-Sleep Algorithm for Unsupervised Neural Networks,In Science, 1995

• R. M. Neal: Connectionist Learning of Belief Networks,In Artificial Intelligence, 1992

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 60: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Deep Belief Networks

. . .x

. . .h1

. . .h2

. . .h3

The top two layers now have undirected edges

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 61: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Deep Belief Networks

The joint probability changes as:

P (x,h1, . . . ,hl) = P (hl,hl−1)( l−2∏k=1

P (hk|hk+1))P (x|h1)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 62: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Deep Belief Networks

The joint probability changes as:

P (x,h1, . . . ,hl) = P (hl,hl−1)( l−2∏k=1

P (hk|hk+1))P (x|h1)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 63: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Deep Belief Networks

. . .hl

. . .hl+1

The top two layers are a Restricted Boltzmann Machine

A RBM has the joint distribution:

P (hl+1,hl) ∝ exp(b′hl−1 + c′hl + hlWhl−1)

We will return to RBMs and training procedures in a while,but first we look at the mathematical machinery that willmake our task easier

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 64: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Deep Belief Networks

. . .hl

. . .hl+1

The top two layers are a Restricted Boltzmann Machine

A RBM has the joint distribution:

P (hl+1,hl) ∝ exp(b′hl−1 + c′hl + hlWhl−1)

We will return to RBMs and training procedures in a while,but first we look at the mathematical machinery that willmake our task easier

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 65: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Deep Belief Networks

. . .hl

. . .hl+1

The top two layers are a Restricted Boltzmann Machine

A RBM has the joint distribution:

P (hl+1,hl) ∝ exp(b′hl−1 + c′hl + hlWhl−1)

We will return to RBMs and training procedures in a while,but first we look at the mathematical machinery that willmake our task easier

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 66: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Deep Belief Networks

. . .hl

. . .hl+1

The top two layers are a Restricted Boltzmann Machine

A RBM has the joint distribution:

P (hl+1,hl) ∝ exp(b′hl−1 + c′hl + hlWhl−1)

We will return to RBMs and training procedures in a while,but first we look at the mathematical machinery that willmake our task easier

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 67: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Energy Based Models

Energy-Based Models assign a scalar energy with everyconfiguration of variables under consideration

Learning: Change the energy function so that its final shapehas some desirable properties

We can define a probability distribution through an energy:

P (x) =exp−(Energy(x))

Z

Energies are in the log-probability domain:

Energy(x) = log1

(ZP (x))

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 68: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Energy Based Models

Energy-Based Models assign a scalar energy with everyconfiguration of variables under consideration

Learning: Change the energy function so that its final shapehas some desirable properties

We can define a probability distribution through an energy:

P (x) =exp−(Energy(x))

Z

Energies are in the log-probability domain:

Energy(x) = log1

(ZP (x))

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 69: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Energy Based Models

Energy-Based Models assign a scalar energy with everyconfiguration of variables under consideration

Learning: Change the energy function so that its final shapehas some desirable properties

We can define a probability distribution through an energy:

P (x) =exp−(Energy(x))

Z

Energies are in the log-probability domain:

Energy(x) = log1

(ZP (x))

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 70: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Energy Based Models

Energy-Based Models assign a scalar energy with everyconfiguration of variables under consideration

Learning: Change the energy function so that its final shapehas some desirable properties

We can define a probability distribution through an energy:

P (x) =exp−(Energy(x))

Z

Energies are in the log-probability domain:

Energy(x) = log1

(ZP (x))

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 71: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Energy Based Models

P (x) =exp−(Energy(x))

Z

Z is a normalizing factor called the Partition Function

Z =∑x

exp(−Energy(x))

How do we specify the energy function?

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 72: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Energy Based Models

P (x) =exp−(Energy(x))

Z

Z is a normalizing factor called the Partition Function

Z =∑x

exp(−Energy(x))

How do we specify the energy function?

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 73: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Energy Based Models

P (x) =exp−(Energy(x))

Z

Z is a normalizing factor called the Partition Function

Z =∑x

exp(−Energy(x))

How do we specify the energy function?

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 74: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Product of Experts Formulation

In this formulation, the energy function is:

Energy(x) =∑i

fi(x)

Therefore:

P (x) =exp−(

∑i fi(x))

Z

We have the product of experts:

P (x) ∝∏i

Pi(x) ∝∏i

exp(−fi(x))

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 75: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Product of Experts Formulation

In this formulation, the energy function is:

Energy(x) =∑i

fi(x)

Therefore:

P (x) =exp−(

∑i fi(x))

Z

We have the product of experts:

P (x) ∝∏i

Pi(x) ∝∏i

exp(−fi(x))

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 76: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Product of Experts Formulation

In this formulation, the energy function is:

Energy(x) =∑i

fi(x)

Therefore:

P (x) =exp−(

∑i fi(x))

Z

We have the product of experts:

P (x) ∝∏i

Pi(x) ∝∏i

exp(−fi(x))

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 77: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Product of Experts Formulation

P (x) ∝∏i

Pi(x) ∝∏i

exp(−fi(x))

Every expert fi can be seen as enforcing a constraint on x

If fi is large =⇒ Pi(x) is small i.e. the expert thinks x isimplausible (constraint violated)

If fi is small =⇒ Pi(x) is large i.e. the expert thinks x isplausible (constraint satisfied)

Contrast this with mixture models

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 78: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Product of Experts Formulation

P (x) ∝∏i

Pi(x) ∝∏i

exp(−fi(x))

Every expert fi can be seen as enforcing a constraint on x

If fi is large =⇒ Pi(x) is small i.e. the expert thinks x isimplausible (constraint violated)

If fi is small =⇒ Pi(x) is large i.e. the expert thinks x isplausible (constraint satisfied)

Contrast this with mixture models

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 79: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Product of Experts Formulation

P (x) ∝∏i

Pi(x) ∝∏i

exp(−fi(x))

Every expert fi can be seen as enforcing a constraint on x

If fi is large =⇒ Pi(x) is small i.e. the expert thinks x isimplausible (constraint violated)

If fi is small =⇒ Pi(x) is large i.e. the expert thinks x isplausible (constraint satisfied)

Contrast this with mixture models

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 80: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Product of Experts Formulation

P (x) ∝∏i

Pi(x) ∝∏i

exp(−fi(x))

Every expert fi can be seen as enforcing a constraint on x

If fi is large =⇒ Pi(x) is small i.e. the expert thinks x isimplausible (constraint violated)

If fi is small =⇒ Pi(x) is large i.e. the expert thinks x isplausible (constraint satisfied)

Contrast this with mixture models

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 81: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Latent Variables

x is observed, let’s say h are hidden factors that explain x

The probability then becomes:

P (x,h) =exp−(Energy(x,h))

Z

We only care about the marginal:

P (x) =∑h

exp−(Energy(x,h))

Z

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 82: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Latent Variables

x is observed, let’s say h are hidden factors that explain x

The probability then becomes:

P (x,h) =exp−(Energy(x,h))

Z

We only care about the marginal:

P (x) =∑h

exp−(Energy(x,h))

Z

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 83: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Latent Variables

x is observed, let’s say h are hidden factors that explain x

The probability then becomes:

P (x,h) =exp−(Energy(x,h))

Z

We only care about the marginal:

P (x) =∑h

exp−(Energy(x,h))

Z

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 84: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Latent Variables

x is observed, let’s say h are hidden factors that explain x

The probability then becomes:

P (x,h) =exp−(Energy(x,h))

Z

We only care about the marginal:

P (x) =∑h

exp−(Energy(x,h))

Z

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 85: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Latent Variables

P (x) =∑h

exp−(Energy(x,h))

Z

We introduce another term in analogy from statistical physics:free energy:

P (x) =exp−(FreeEnergy(x))

Z

Free Energy is just a marginalization of energies in thelog-domain:

FreeEnergy(x) = − log∑h

exp−(Energy(x,h))

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 86: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Latent Variables

P (x) =∑h

exp−(Energy(x,h))

Z

We introduce another term in analogy from statistical physics:free energy:

P (x) =exp−(FreeEnergy(x))

Z

Free Energy is just a marginalization of energies in thelog-domain:

FreeEnergy(x) = − log∑h

exp−(Energy(x,h))

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 87: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Latent Variables

P (x) =∑h

exp−(Energy(x,h))

Z

We introduce another term in analogy from statistical physics:free energy:

P (x) =exp−(FreeEnergy(x))

Z

Free Energy is just a marginalization of energies in thelog-domain:

FreeEnergy(x) = − log∑h

exp−(Energy(x,h))

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 88: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Latent Variables

P (x) =exp−(FreeEnergy(x))

Z

Likewise, the partition function:

Z =∑x

exp−FreeEnergy(x)

We have an expression for P (x) (and hence for the datalog-likelihood). Let us see how the gradient looks like

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 89: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Latent Variables

P (x) =exp−(FreeEnergy(x))

Z

Likewise, the partition function:

Z =∑x

exp−FreeEnergy(x)

We have an expression for P (x) (and hence for the datalog-likelihood). Let us see how the gradient looks like

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 90: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Latent Variables

P (x) =exp−(FreeEnergy(x))

Z

Likewise, the partition function:

Z =∑x

exp−FreeEnergy(x)

We have an expression for P (x) (and hence for the datalog-likelihood). Let us see how the gradient looks like

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 91: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Data Log-Likelihood Gradient

P (x) =exp−(FreeEnergy(x))

Z

The gradient is simply working from the above:

∂ logP (x)

∂θ= −∂FreeEnergy(x)

∂θ

+1

Z

∑x̃

exp−(FreeEnergy(x̃))∂FreeEnergy(x̃)

∂θ

Note that P (x̃) = exp−(FreeEnergy(x̃))

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 92: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Data Log-Likelihood Gradient

P (x) =exp−(FreeEnergy(x))

Z

The gradient is simply working from the above:

∂ logP (x)

∂θ= −∂FreeEnergy(x)

∂θ

+1

Z

∑x̃

exp−(FreeEnergy(x̃))∂FreeEnergy(x̃)

∂θ

Note that P (x̃) = exp−(FreeEnergy(x̃))

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 93: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Data Log-Likelihood Gradient

P (x) =exp−(FreeEnergy(x))

Z

The gradient is simply working from the above:

∂ logP (x)

∂θ= −∂FreeEnergy(x)

∂θ

+1

Z

∑x̃

exp−(FreeEnergy(x̃))∂FreeEnergy(x̃)

∂θ

Note that P (x̃) = exp−(FreeEnergy(x̃))

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 94: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Data Log-Likelihood Gradient

The expected log-likelihood gradient over the training set hasthe following form:

EP̃

[∂ logP (x)

∂θ

]= EP̃

[∂FreeEnergy(x)

∂θ

]+EP

[∂FreeEnergy(x)

∂θ

]

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 95: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Data Log-Likelihood Gradient

EP̃

[∂ logP (x)

∂θ

]= EP̃

[∂FreeEnergy(x)

∂θ

]+EP

[∂FreeEnergy(x)

∂θ

]

P̃ is the empirical training distribution

Easy to compute!

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 96: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Data Log-Likelihood Gradient

EP̃

[∂ logP (x)

∂θ

]= EP̃

[∂FreeEnergy(x)

∂θ

]+EP

[∂FreeEnergy(x)

∂θ

]

P̃ is the empirical training distribution

Easy to compute!

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 97: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Data Log-Likelihood Gradient

EP̃

[∂ logP (x)

∂θ

]= EP̃

[∂FreeEnergy(x)

∂θ

]+EP

[∂FreeEnergy(x)

∂θ

]

P is the model distribution (exponentially manyconfigurations!)

Usually very hard to compute!

Resort to Markov Chain Monte Carlo to get a stochasticestimator of the gradient

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 98: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Data Log-Likelihood Gradient

EP̃

[∂ logP (x)

∂θ

]= EP̃

[∂FreeEnergy(x)

∂θ

]+EP

[∂FreeEnergy(x)

∂θ

]

P is the model distribution (exponentially manyconfigurations!)

Usually very hard to compute!

Resort to Markov Chain Monte Carlo to get a stochasticestimator of the gradient

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 99: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Data Log-Likelihood Gradient

EP̃

[∂ logP (x)

∂θ

]= EP̃

[∂FreeEnergy(x)

∂θ

]+EP

[∂FreeEnergy(x)

∂θ

]

P is the model distribution (exponentially manyconfigurations!)

Usually very hard to compute!

Resort to Markov Chain Monte Carlo to get a stochasticestimator of the gradient

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 100: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

A Special Case

Suppose the energy has the following form:

Energy(x,h) = −β(x) +∑i

γi(x,hi)

The free energy, and numerator of log likelihood can becomputed tractably!

What is P (x)?

What is the FreeEnergy(x)?

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 101: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

A Special Case

Suppose the energy has the following form:

Energy(x,h) = −β(x) +∑i

γi(x,hi)

The free energy, and numerator of log likelihood can becomputed tractably!

What is P (x)?

What is the FreeEnergy(x)?

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 102: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

A Special Case

Suppose the energy has the following form:

Energy(x,h) = −β(x) +∑i

γi(x,hi)

The free energy, and numerator of log likelihood can becomputed tractably!

What is P (x)?

What is the FreeEnergy(x)?

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 103: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

A Special Case

Suppose the energy has the following form:

Energy(x,h) = −β(x) +∑i

γi(x,hi)

The free energy, and numerator of log likelihood can becomputed tractably!

What is P (x)?

What is the FreeEnergy(x)?

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 104: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

A Special Case

The likelihood term:

P (x) =expβ(x)

Z

∏i

∑hi

exp−γi(x,hi)

The Free Energy term:

FreeEnergy(x) = − logP (x)− logZ

= −β −∑i

log∑hi

exp−γi(x,hi)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 105: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

A Special Case

The likelihood term:

P (x) =expβ(x)

Z

∏i

∑hi

exp−γi(x,hi)

The Free Energy term:

FreeEnergy(x) = − logP (x)− logZ

= −β −∑i

log∑hi

exp−γi(x,hi)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 106: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Restricted Boltzmann Machines

. . .hl

. . .hl+1

Recall the form of energy:

Energy(x,h) = −bTx− cTh− hTWx

Takes the earlier nice form with β(x) = bTx andγi(x,hi) = hi(ci +Wix)

Originally proposed by Smolensky (1987) who called themHarmoniums as a special case of Boltzmann Machines

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 107: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Restricted Boltzmann Machines

. . .hl

. . .hl+1

Recall the form of energy:

Energy(x,h) = −bTx− cTh− hTWx

Takes the earlier nice form with β(x) = bTx andγi(x,hi) = hi(ci +Wix)

Originally proposed by Smolensky (1987) who called themHarmoniums as a special case of Boltzmann Machines

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 108: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Restricted Boltzmann Machines

. . .hl

. . .hl+1

Recall the form of energy:

Energy(x,h) = −bTx− cTh− hTWx

Takes the earlier nice form with β(x) = bTx andγi(x,hi) = hi(ci +Wix)

Originally proposed by Smolensky (1987) who called themHarmoniums as a special case of Boltzmann Machines

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 109: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Restricted Boltzmann Machines

. . .hl

. . .hl+1

Recall the form of energy:

Energy(x,h) = −bTx− cTh− hTWx

Takes the earlier nice form with β(x) = bTx andγi(x,hi) = hi(ci +Wix)

Originally proposed by Smolensky (1987) who called themHarmoniums as a special case of Boltzmann Machines

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 110: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Restricted Boltzmann Machines

. . .hl

. . .hl+1

As seen before, the Free Energy can be computed efficiently:

FreeEnergy(x) = −bTx−∑i

log∑hi

exphi(ci+Wix)

The conditional probability:

P (h|x) =exp (bTx + cTh + hTWx)∑h̃ exp (bTx + cT h̃ + h̃TWx)

=∏i

P (hi|x)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 111: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Restricted Boltzmann Machines

. . .hl

. . .hl+1

As seen before, the Free Energy can be computed efficiently:

FreeEnergy(x) = −bTx−∑i

log∑hi

exphi(ci+Wix)

The conditional probability:

P (h|x) =exp (bTx + cTh + hTWx)∑h̃ exp (bTx + cT h̃ + h̃TWx)

=∏i

P (hi|x)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 112: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Restricted Boltzmann Machines

. . .hl

. . .hl+1

x and h play symmetric roles:

P (x|h) =∏i

P (xi|h)

The common transfer (for the binary case):

P (hi = 1|x) = σ(ci +Wix)

P (xj = 1|h) = σ(bj +W T:,jh)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 113: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Restricted Boltzmann Machines

. . .hl

. . .hl+1

x and h play symmetric roles:

P (x|h) =∏i

P (xi|h)

The common transfer (for the binary case):

P (hi = 1|x) = σ(ci +Wix)

P (xj = 1|h) = σ(bj +W T:,jh)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 114: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Restricted Boltzmann Machines

. . .hl

. . .hl+1

x and h play symmetric roles:

P (x|h) =∏i

P (xi|h)

The common transfer (for the binary case):

P (hi = 1|x) = σ(ci +Wix)

P (xj = 1|h) = σ(bj +W T:,jh)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 115: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Restricted Boltzmann Machines

. . .hl

. . .hl+1

x and h play symmetric roles:

P (x|h) =∏i

P (xi|h)

The common transfer (for the binary case):

P (hi = 1|x) = σ(ci +Wix)

P (xj = 1|h) = σ(bj +W T:,jh)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 116: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Restricted Boltzmann Machines

. . .hl

. . .hl+1

x and h play symmetric roles:

P (x|h) =∏i

P (xi|h)

The common transfer (for the binary case):

P (hi = 1|x) = σ(ci +Wix)

P (xj = 1|h) = σ(bj +W T:,jh)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 117: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Restricted Boltzmann Machines

. . .hl

. . .hl+1

x and h play symmetric roles:

P (x|h) =∏i

P (xi|h)

The common transfer (for the binary case):

P (hi = 1|x) = σ(ci +Wix)

P (xj = 1|h) = σ(bj +W T:,jh)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 118: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Approximate Learning and Gibbs Sampling

EP̃

[∂ logP (x)

∂θ

]= EP̃

[∂FreeEnergy(x)

∂θ

]+EP

[∂FreeEnergy(x)

∂θ

]

We saw the expression for Free Energy for a RBM. But thesecond term was intractable. How do learn in this case?

Replace the average over all possible input configurations bysamples

Run Markov Chain Monte Carlo (Gibbs Sampling):

First sample x1 ∼ P̃ (x), then h1 ∼ P (h|x1), thenx2 ∼ P (x|h1), then h2 ∼ P (h|x2) till xk+1

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 119: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Approximate Learning and Gibbs Sampling

EP̃

[∂ logP (x)

∂θ

]= EP̃

[∂FreeEnergy(x)

∂θ

]+EP

[∂FreeEnergy(x)

∂θ

]

We saw the expression for Free Energy for a RBM. But thesecond term was intractable. How do learn in this case?

Replace the average over all possible input configurations bysamples

Run Markov Chain Monte Carlo (Gibbs Sampling):

First sample x1 ∼ P̃ (x), then h1 ∼ P (h|x1), thenx2 ∼ P (x|h1), then h2 ∼ P (h|x2) till xk+1

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 120: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Approximate Learning and Gibbs Sampling

EP̃

[∂ logP (x)

∂θ

]= EP̃

[∂FreeEnergy(x)

∂θ

]+EP

[∂FreeEnergy(x)

∂θ

]

We saw the expression for Free Energy for a RBM. But thesecond term was intractable. How do learn in this case?

Replace the average over all possible input configurations bysamples

Run Markov Chain Monte Carlo (Gibbs Sampling):

First sample x1 ∼ P̃ (x), then h1 ∼ P (h|x1), thenx2 ∼ P (x|h1), then h2 ∼ P (h|x2) till xk+1

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 121: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Approximate Learning and Gibbs Sampling

EP̃

[∂ logP (x)

∂θ

]= EP̃

[∂FreeEnergy(x)

∂θ

]+EP

[∂FreeEnergy(x)

∂θ

]

We saw the expression for Free Energy for a RBM. But thesecond term was intractable. How do learn in this case?

Replace the average over all possible input configurations bysamples

Run Markov Chain Monte Carlo (Gibbs Sampling):

First sample x1 ∼ P̃ (x), then h1 ∼ P (h|x1), thenx2 ∼ P (x|h1), then h2 ∼ P (h|x2) till xk+1

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 122: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Approximate Learning, Alternating GibbsSampling

We have already seen: P (x|h) =∏i

P (xi|h) and

P (h|x) =∏i

P (hi|x)

With: P (hi = 1|x) = σ(ci +Wix) andP (xj = 1|h) = σ(bj +W T

:,jh)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 123: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Approximate Learning, Alternating GibbsSampling

We have already seen: P (x|h) =∏i

P (xi|h) and

P (h|x) =∏i

P (hi|x)

With: P (hi = 1|x) = σ(ci +Wix) andP (xj = 1|h) = σ(bj +W T

:,jh)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 124: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Training a RBM: The Contrastive DivergenceAlgorithm

Start with a training example on the visible units

Update all the hidden units in parallel

Update all the visible units in parallel to obtain areconstruction

Update all the hidden units again

Update model parameters

Aside: Easy to extend RBM (and contrastive divergence) tothe continuous case

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 125: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Training a RBM: The Contrastive DivergenceAlgorithm

Start with a training example on the visible units

Update all the hidden units in parallel

Update all the visible units in parallel to obtain areconstruction

Update all the hidden units again

Update model parameters

Aside: Easy to extend RBM (and contrastive divergence) tothe continuous case

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 126: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Training a RBM: The Contrastive DivergenceAlgorithm

Start with a training example on the visible units

Update all the hidden units in parallel

Update all the visible units in parallel to obtain areconstruction

Update all the hidden units again

Update model parameters

Aside: Easy to extend RBM (and contrastive divergence) tothe continuous case

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 127: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Training a RBM: The Contrastive DivergenceAlgorithm

Start with a training example on the visible units

Update all the hidden units in parallel

Update all the visible units in parallel to obtain areconstruction

Update all the hidden units again

Update model parameters

Aside: Easy to extend RBM (and contrastive divergence) tothe continuous case

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 128: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Training a RBM: The Contrastive DivergenceAlgorithm

Start with a training example on the visible units

Update all the hidden units in parallel

Update all the visible units in parallel to obtain areconstruction

Update all the hidden units again

Update model parameters

Aside: Easy to extend RBM (and contrastive divergence) tothe continuous case

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 129: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Training a RBM: The Contrastive DivergenceAlgorithm

Start with a training example on the visible units

Update all the hidden units in parallel

Update all the visible units in parallel to obtain areconstruction

Update all the hidden units again

Update model parameters

Aside: Easy to extend RBM (and contrastive divergence) tothe continuous case

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 130: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Boltzmann Machines

A model in which the energy has the form:

Energy(x,h) = −bTx− cTh− hTWx− xTUx− hTV h

Originally proposed by Hinton and Sejnowski (1983)

Important historically. But very difficult to train (why?)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 131: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Boltzmann Machines

A model in which the energy has the form:

Energy(x,h) = −bTx− cTh− hTWx− xTUx− hTV h

Originally proposed by Hinton and Sejnowski (1983)

Important historically. But very difficult to train (why?)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 132: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Boltzmann Machines

A model in which the energy has the form:

Energy(x,h) = −bTx− cTh− hTWx− xTUx− hTV h

Originally proposed by Hinton and Sejnowski (1983)

Important historically. But very difficult to train (why?)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 133: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Back to Deep Belief Networks

. . .x

. . .h1

. . .h2

. . .h3

P (x,h1, . . . ,hl) = P (hl,hl−1)( l−2∏k=1

P (hk|hk+1))P (x|h1)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 134: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Back to Deep Belief Networks

. . .x

. . .h1

. . .h2

. . .h3

P (x,h1, . . . ,hl) = P (hl,hl−1)( l−2∏k=1

P (hk|hk+1))P (x|h1)

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 135: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Greedy Layer-wise Training of DBNs

Reference: G. E. Hinton, S. Osindero and Y-W Teh: A FastLearning Algorithm for Deep Belief Networks, In NeuralComputation, 2006.

First Step: Construct a RBM with input x and a hidden layerh, train the RBM

Stack another layer on top of the RBM to form a new RBM.Fix W 1, sample from P (h1|x), train W 2 as RBM

Continue till k layers

Implicitly defines P (x) and P (h) (variational bound justifieslayerwise training)

Can then be discriminatively fine-tuned using backpropagation

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 136: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Greedy Layer-wise Training of DBNs

Reference: G. E. Hinton, S. Osindero and Y-W Teh: A FastLearning Algorithm for Deep Belief Networks, In NeuralComputation, 2006.

First Step: Construct a RBM with input x and a hidden layerh, train the RBM

Stack another layer on top of the RBM to form a new RBM.Fix W 1, sample from P (h1|x), train W 2 as RBM

Continue till k layers

Implicitly defines P (x) and P (h) (variational bound justifieslayerwise training)

Can then be discriminatively fine-tuned using backpropagation

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 137: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Greedy Layer-wise Training of DBNs

Reference: G. E. Hinton, S. Osindero and Y-W Teh: A FastLearning Algorithm for Deep Belief Networks, In NeuralComputation, 2006.

First Step: Construct a RBM with input x and a hidden layerh, train the RBM

Stack another layer on top of the RBM to form a new RBM.Fix W 1, sample from P (h1|x), train W 2 as RBM

Continue till k layers

Implicitly defines P (x) and P (h) (variational bound justifieslayerwise training)

Can then be discriminatively fine-tuned using backpropagation

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 138: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Greedy Layer-wise Training of DBNs

Reference: G. E. Hinton, S. Osindero and Y-W Teh: A FastLearning Algorithm for Deep Belief Networks, In NeuralComputation, 2006.

First Step: Construct a RBM with input x and a hidden layerh, train the RBM

Stack another layer on top of the RBM to form a new RBM.Fix W 1, sample from P (h1|x), train W 2 as RBM

Continue till k layers

Implicitly defines P (x) and P (h) (variational bound justifieslayerwise training)

Can then be discriminatively fine-tuned using backpropagation

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 139: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Greedy Layer-wise Training of DBNs

Reference: G. E. Hinton, S. Osindero and Y-W Teh: A FastLearning Algorithm for Deep Belief Networks, In NeuralComputation, 2006.

First Step: Construct a RBM with input x and a hidden layerh, train the RBM

Stack another layer on top of the RBM to form a new RBM.Fix W 1, sample from P (h1|x), train W 2 as RBM

Continue till k layers

Implicitly defines P (x) and P (h) (variational bound justifieslayerwise training)

Can then be discriminatively fine-tuned using backpropagation

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 140: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Greedy Layer-wise Training of DBNs

Reference: G. E. Hinton, S. Osindero and Y-W Teh: A FastLearning Algorithm for Deep Belief Networks, In NeuralComputation, 2006.

First Step: Construct a RBM with input x and a hidden layerh, train the RBM

Stack another layer on top of the RBM to form a new RBM.Fix W 1, sample from P (h1|x), train W 2 as RBM

Continue till k layers

Implicitly defines P (x) and P (h) (variational bound justifieslayerwise training)

Can then be discriminatively fine-tuned using backpropagation

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 141: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Deep Autoencoders (2006)

G. E. Hinton, R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science, 2006

From last time: Was hard to train deep networks from scratch in2006!

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 142: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Semantic Hashing

G. Hinton and R. Salakhutdinov, ”Semantic Hashing”, 2006

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 143: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Why does Unsupervised Pre-training work?

Regularization. Feature representations that are good forP (x) are good for P (y|x)

Optimization: Unsupervised pre-training leads to betterregions of the space i.e. better than random initialization

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 144: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Why does Unsupervised Pre-training work?

Regularization. Feature representations that are good forP (x) are good for P (y|x)

Optimization: Unsupervised pre-training leads to betterregions of the space i.e. better than random initialization

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 145: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Effect of Unsupervised Pre-training

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 146: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Effect of Unsupervised Pre-training

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 147: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Important topics we didn’t talk about in detail/at all:

• Joint unsupervised training of all layers (Wake-Sleepalgorithm)

• Deep Boltzmann Machines• Variational bounds justifying greedy layerwise training• Conditional RBMs, Multimodal RBMs, Temporal RBMs

etc

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 148: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Next Time

Some Applications of methods we just considered

Generative Adversarial Networks

Lecture 16 Deep Neural Generative Models CMSC 35246

Page 149: Lecture 16 Deep Neural Generative Models - CMSC 35246 ...shubhendu/Pages/Files/Lecture16_pauses.… · Lecture 16 Deep Neural Generative Models CMSC 35246. Representations Figure:

Next Time

Some Applications of methods we just considered

Generative Adversarial Networks

Lecture 16 Deep Neural Generative Models CMSC 35246