cpeg 589 – advanced deep learning lecture 5

CPEG 589 – Advanced Deep Learning

Lecture 5

1

Outline

GAN – How do we find latent representation for a given image?

Generative Models

Generative Adversarial Networks - GANs

Variational Auto Encoders – VAEs

Data Generation Approaches in GANs and Difficulty of Obtaining variations of new Data

VAEs Approach to Generating Data

Theory of VAEs

Variations of VAEs

VAEs vs GANs

2

GAN – Data Generation

• Difficulty with GAN – Given a new real image (or data), how do we create

variations of it?

3

GAN – Determining Latent Representation

Train another Network to learn the reverse Mapping (from image to z)

One possibility is to use (Projected Gradient Descent) PGD technique

4

Background – AutoEncoders

AutoEncoders learn the Compressed Latent Space and how to Recover the

Compressed Data from it through Unsupervised Training

5

AutoEncoding MNIST

6

Data Generation Via AutoEncoder

7

VAE

Sample

8

VAE Concept

𝑿 = input

𝑷 𝑿 = distribution associated with 𝐗

𝒁 = latent/hidden variable = encoded output

𝑷 𝒁 = target latent distribution

Goal: to make the latent space follow a known continuous

distribution (e.g. normal distribution) so as to sample from this

distribution and generate artificial data

𝐄𝐧𝐜𝐨𝐝𝐞𝐫𝑿𝑷(𝑿)

𝐃𝐞𝐜𝐨𝐝𝐞𝐫𝒁

𝑷(𝑿|𝒁)𝑷(𝒁)

𝑸(𝒁|𝑿) 𝑿

9

VAE Theory

Suppose we approximate a distribution 𝑃 𝑍 𝑋 with some 𝑄 𝑍 𝑋 distribution

Let us minimize the KL divergence as follows:

𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃 𝑍 𝑋 = 𝑍𝑄 𝑍 𝑋 log𝑄 𝑍 𝑋𝑃 𝑍 𝑋

= E log𝑄 𝑍 𝑋𝑃 𝑍 𝑋 = E log𝑄 𝑍 𝑋 − log𝑃 𝑍 𝑋

Using Bayes′rule: 𝑃 𝑍 𝑋 =𝑃 𝑋 𝑍 𝑃(𝑍)

𝑃(𝑋)

𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃 𝑍 𝑋 = E log𝑄 𝑍 𝑋 − log𝑃 𝑋 𝑍 𝑃(𝑍)

𝑃(𝑋)

= E log𝑄 𝑍 𝑋 − log 𝑃 𝑋 𝑍 + log𝑃 𝑍 − log 𝑃(𝑋)

= E log𝑄 𝑍 𝑋 − log 𝑃 𝑋 𝑍 − log 𝑃 𝑍 + log𝑃(𝑋)10

VAE Theory

𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃 𝑍 𝑋 = Ez log𝑄 𝑍 𝑋 − log𝑃 𝑋 𝑍 − log𝑃 𝑍 + log𝑃(𝑋)

Since the expectation is over 𝑍, 𝑃 𝑋 can be separated out:

𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃 𝑍 𝑋 = E[log𝑄 𝑍 𝑋 − log𝑃 𝑋 𝑍 − log𝑃 𝑍 ] + log𝑃(𝑋)

𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃 𝑍 𝑋 − log𝑃 𝑋 = E[log𝑄 𝑍 𝑋 − log𝑃 𝑋 𝑍 − log𝑃 𝑍 ]

log𝑃 𝑋 − 𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃 𝑍 𝑋 = E[log𝑃 𝑋 𝑍 − (log𝑄 𝑍 𝑋 − log𝑃 𝑍 )]

= E[log 𝑃 𝑋 𝑍 ] − E[log𝑄 𝑍 𝑋 − log𝑃 𝑍 ]

𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃(𝑍)

11

VAE Theory

log 𝑃 𝑋 − 𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃 𝑍 𝑋 = E[log𝑃 𝑋 𝑍 ] − 𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃(𝑍)

reconstructionconstant fora given 𝑋

forceddistribution

𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃 𝑍 𝑋 = log𝑃 𝑋 − E[log 𝑃 𝑋 𝑍 ] − 𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃(𝑍)

Since 𝐷𝐾𝐿 is always positive, we can conclude that:

log 𝑃 𝑋 ≥ E[log 𝑃 𝑋 𝑍 ] − 𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃(𝑍)

ELBO

log 𝑃 𝑋 − 𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃 𝑍 𝑋 = E[log 𝑃 𝑋 𝑍 − (log𝑄 𝑍 𝑋 − log 𝑃 𝑍 )]

= E[log 𝑃 𝑋 𝑍 ] − E[log𝑄 𝑍 𝑋 − log𝑃 𝑍 ]

12

VAE - Theory

log 𝑃 𝑋 ≥ E[log 𝑃 𝑋 𝑍 ] − 𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃(𝑍)

How do we maximize log 𝑃 𝑋 ?

To easily sample 𝑃 𝑍 and generate new data, we will

let 𝑃 𝑍 be a normal distribution, i.e. 𝑁(0,1)

Let 𝑄 𝑍 𝑋 be Gaussian with parameters 𝜇(𝑥) and 𝛴(𝑥)

The KL divergence between 𝑄 𝑍 𝑋 and 𝑃 𝑍 is then

computed in closed form as follows…

ELBO

13

VAE Implementation

𝐷𝐾𝐿 𝑁(𝜇 𝑥 , 𝛴(𝑥)) 𝑁 0,1 =1

2𝑡𝑟 𝛴 𝑥 + 𝜇 𝑥 𝑇𝜇 𝑥 − 𝑘 − log det(𝛴(𝑥))

𝐷𝐾𝐿 𝑁(𝜇 𝑥 , 𝛴(𝑥)) 𝑁 0,1 =1

2 𝑘 𝛴 𝑥 + 𝑘 𝜇

2 𝑥 − 𝑘 1 − log 𝑘 𝛴 𝑥

=1

2 𝑘 𝛴 𝑥 + 𝑘 𝜇

2 𝑥 − 𝑘 1 − 𝑘 log 𝛴 𝑥

=1

2 𝑘 𝛴 𝑥 + 𝜇2 𝑥 − 1 − log 𝛴 𝑥

Sometimes we replace 𝛴(𝑥) with 𝑒𝛴(𝑥):

𝐷𝐾𝐿 𝑁(𝜇 𝑥 , 𝛴(𝑥)) 𝑁 0,1 =1

2 𝑘 (exp 𝛴 𝑥 ) + 𝜇2 𝑥 − 1 − 𝛴 𝑥

Above will become the KL loss term in the implementation

14

Reparameterization Trick Because there is sampling involved, backpropagation through a random node is involved

Sample

15

VAE Loss Function

Gradient of Loss with respect to θ

ELBO = E[log 𝑃 𝑋 𝑍 ] − 𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃(𝑍)

Loss in VAE = 𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃(𝑍) - E[log 𝑃 𝑋 𝑍 ]

How do we take gradient of an expectation?

16

Gradient of Expectation What if z follows

17

VAE – Gradient of Expectation

18

Reparameterization Trick Two approaches to computing the loss of the VAE

1.

2. Use Reparameterization Trick (Introduce a new parameter ϵ allows that allows

us to reparametrize z in a way that allows backprop to flow through the

deterministic nodes.

19

Reparameterization Trick

General Idea:

20

VAE Implementation - PyTorch

class Utils(object):def PrepareData(self):

batch_size = 100# get Datasettrain_dataset = datasets.MNIST(root='./mnist_data/', train=True,

transform=transforms.ToTensor(), download=True)test_dataset = datasets.MNIST(root='./mnist_data/', train=False,

transform=transforms.ToTensor(), download=True)

# Data Loader train_loader = torch.utils.data.DataLoader(dataset=train_dataset,

batch_size=batch_size, shuffle=True)test_loader = torch.utils.data.DataLoader(dataset=test_dataset,

batch_size=batch_size, shuffle=False)return train_loader, test_loader

21

VAE Implementationclass VAEModel(nn.Module):

def __init__(self, x_dim, h_dim1, h_dim2, z_dim):

super(VAEModel, self).__init__()

# ----encoder components

self.fce1 = nn.Linear(x_dim, h_dim1)

self.fce2 = nn.Linear(h_dim1, h_dim2)

self.fcMu = nn.Linear(h_dim2, z_dim)

self.fcSigma = nn.Linear(h_dim2, z_dim)

# ----decoder components

self.fcd1 = nn.Linear(z_dim, h_dim2)

self.fcd2 = nn.Linear(h_dim2, h_dim1)

self.fcdout = nn.Linear(h_dim1, x_dim)

def encoder(self, x):

h = F.relu(self.fce1(x))

h = F.relu(self.fce2(h))

return self.fcMu(h), self.fcSigma(h) # mu, log_var

def reparameter_sampling(self, mu, log_var):

std = torch.exp(0.5*log_var)

eps = torch.randn_like(std)

return eps.mul(std).add_(mu) # return z sample22

def decoder(self, z):h = F.relu(self.fcd1(z))h = F.relu(self.fcd2(h))return F.sigmoid(self.fcdout(h))

def forward(self, x):mu, log_var = self.encoder(x.view(-1, 784))z = self.reparameter_sampling(mu, log_var)out = self.decoder(z)return out, mu, log_var

VAE Implementationdef main():

ngpu = 1

device = torch.device("cuda:0" if (torch.cuda.is_available() and ngpu > 0) else "cpu")

# prepare data loaders

utils = Utils()

train_loader, test_loader = utils.PrepareData()

# build model

vaemodel = VAEModel(x_dim=784, h_dim1= 512, h_dim2=256, z_dim=2)

if torch.cuda.is_available():

vaemodel.cuda()

epochs = 10

optimizer = optim.Adam(vaemodel.parameters())

train(epochs, vaemodel, train_loader, test_loader, optimizer)

with torch.no_grad():

z = torch.randn(64, 2).cuda() # random sample

gen_sample = vaemodel.decoder(z).cuda()23

VAE Implementationdef loss_function(recon_x, x, mu, log_var):

BCE = F.binary_cross_entropy(recon_x, x.view(-1, 784), reduction='sum')

KLD = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())

return BCE + KLD

def train(epochs, vaemodel, train_loader, test_loader, optimizer):

vaemodel.train() # set it in train mode

train_loss = 0

for i in range(epochs):

for batch_idx, (data, _) in enumerate(train_loader):

data = data.cuda()

optimizer.zero_grad() # clear gradients

recon_batch, mu, log_var = vaemodel(data)

loss = loss_function(recon_batch, data, mu, log_var)

loss.backward()

train_loss += loss.item()

optimizer.step()24

Results – VAE Simple

25

Variations in VAEs Loss in VAE = 𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃(𝑍) - E[log𝑃 𝑋 𝑍 ]

Problem:

The encoder may learn just P(z) independent of x

The decoder may just memorize and generate x independent of z

Use weighting of 𝐷𝐾𝐿 term

Above actually makes the encoding worse.

Solution: Use MMD instad of DKL

MMD – two distributions are equal if their moments are same

26

VAE – Data Generation

27

cpeg 589 – advanced deep learning lecture 5

Documents