cpeg 589 – advanced deep learning lecture 5
TRANSCRIPT
CPEG 589 – Advanced Deep Learning
Lecture 5
1
Outline
GAN – How do we find latent representation for a given image?
Generative Models
Generative Adversarial Networks - GANs
Variational Auto Encoders – VAEs
Data Generation Approaches in GANs and Difficulty of Obtaining variations of new Data
VAEs Approach to Generating Data
Theory of VAEs
Variations of VAEs
VAEs vs GANs
2
GAN – Data Generation
• Difficulty with GAN – Given a new real image (or data), how do we create
variations of it?
3
GAN – Determining Latent Representation
Train another Network to learn the reverse Mapping (from image to z)
One possibility is to use (Projected Gradient Descent) PGD technique
4
Background – AutoEncoders
AutoEncoders learn the Compressed Latent Space and how to Recover the
Compressed Data from it through Unsupervised Training
5
AutoEncoding MNIST
6
Data Generation Via AutoEncoder
7
VAE
Sample
8
VAE Concept
𝑿 = input
𝑷 𝑿 = distribution associated with 𝐗
𝒁 = latent/hidden variable = encoded output
𝑷 𝒁 = target latent distribution
Goal: to make the latent space follow a known continuous
distribution (e.g. normal distribution) so as to sample from this
distribution and generate artificial data
𝐄𝐧𝐜𝐨𝐝𝐞𝐫𝑿𝑷(𝑿)
𝐃𝐞𝐜𝐨𝐝𝐞𝐫𝒁
𝑷(𝑿|𝒁)𝑷(𝒁)
𝑸(𝒁|𝑿) 𝑿
9
VAE Theory
Suppose we approximate a distribution 𝑃 𝑍 𝑋 with some 𝑄 𝑍 𝑋 distribution
Let us minimize the KL divergence as follows:
𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃 𝑍 𝑋 = 𝑍𝑄 𝑍 𝑋 log𝑄 𝑍 𝑋𝑃 𝑍 𝑋
= E log𝑄 𝑍 𝑋𝑃 𝑍 𝑋 = E log𝑄 𝑍 𝑋 − log𝑃 𝑍 𝑋
Using Bayes′rule: 𝑃 𝑍 𝑋 =𝑃 𝑋 𝑍 𝑃(𝑍)
𝑃(𝑋)
𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃 𝑍 𝑋 = E log𝑄 𝑍 𝑋 − log𝑃 𝑋 𝑍 𝑃(𝑍)
𝑃(𝑋)
= E log𝑄 𝑍 𝑋 − log 𝑃 𝑋 𝑍 + log𝑃 𝑍 − log 𝑃(𝑋)
= E log𝑄 𝑍 𝑋 − log 𝑃 𝑋 𝑍 − log 𝑃 𝑍 + log𝑃(𝑋)10
VAE Theory
𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃 𝑍 𝑋 = Ez log𝑄 𝑍 𝑋 − log𝑃 𝑋 𝑍 − log𝑃 𝑍 + log𝑃(𝑋)
Since the expectation is over 𝑍, 𝑃 𝑋 can be separated out:
𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃 𝑍 𝑋 = E[log𝑄 𝑍 𝑋 − log𝑃 𝑋 𝑍 − log𝑃 𝑍 ] + log𝑃(𝑋)
𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃 𝑍 𝑋 − log𝑃 𝑋 = E[log𝑄 𝑍 𝑋 − log𝑃 𝑋 𝑍 − log𝑃 𝑍 ]
log𝑃 𝑋 − 𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃 𝑍 𝑋 = E[log𝑃 𝑋 𝑍 − (log𝑄 𝑍 𝑋 − log𝑃 𝑍 )]
= E[log 𝑃 𝑋 𝑍 ] − E[log𝑄 𝑍 𝑋 − log𝑃 𝑍 ]
𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃(𝑍)
11
VAE Theory
log 𝑃 𝑋 − 𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃 𝑍 𝑋 = E[log𝑃 𝑋 𝑍 ] − 𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃(𝑍)
reconstructionconstant fora given 𝑋
forceddistribution
𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃 𝑍 𝑋 = log𝑃 𝑋 − E[log 𝑃 𝑋 𝑍 ] − 𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃(𝑍)
Since 𝐷𝐾𝐿 is always positive, we can conclude that:
log 𝑃 𝑋 ≥ E[log 𝑃 𝑋 𝑍 ] − 𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃(𝑍)
ELBO
log 𝑃 𝑋 − 𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃 𝑍 𝑋 = E[log 𝑃 𝑋 𝑍 − (log𝑄 𝑍 𝑋 − log 𝑃 𝑍 )]
= E[log 𝑃 𝑋 𝑍 ] − E[log𝑄 𝑍 𝑋 − log𝑃 𝑍 ]
12
VAE - Theory
log 𝑃 𝑋 ≥ E[log 𝑃 𝑋 𝑍 ] − 𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃(𝑍)
How do we maximize log 𝑃 𝑋 ?
To easily sample 𝑃 𝑍 and generate new data, we will
let 𝑃 𝑍 be a normal distribution, i.e. 𝑁(0,1)
Let 𝑄 𝑍 𝑋 be Gaussian with parameters 𝜇(𝑥) and 𝛴(𝑥)
The KL divergence between 𝑄 𝑍 𝑋 and 𝑃 𝑍 is then
computed in closed form as follows…
ELBO
13
VAE Implementation
𝐷𝐾𝐿 𝑁(𝜇 𝑥 , 𝛴(𝑥)) 𝑁 0,1 =1
2𝑡𝑟 𝛴 𝑥 + 𝜇 𝑥 𝑇𝜇 𝑥 − 𝑘 − log det(𝛴(𝑥))
𝐷𝐾𝐿 𝑁(𝜇 𝑥 , 𝛴(𝑥)) 𝑁 0,1 =1
2 𝑘 𝛴 𝑥 + 𝑘 𝜇
2 𝑥 − 𝑘 1 − log 𝑘 𝛴 𝑥
=1
2 𝑘 𝛴 𝑥 + 𝑘 𝜇
2 𝑥 − 𝑘 1 − 𝑘 log 𝛴 𝑥
=1
2 𝑘 𝛴 𝑥 + 𝜇2 𝑥 − 1 − log 𝛴 𝑥
Sometimes we replace 𝛴(𝑥) with 𝑒𝛴(𝑥):
𝐷𝐾𝐿 𝑁(𝜇 𝑥 , 𝛴(𝑥)) 𝑁 0,1 =1
2 𝑘 (exp 𝛴 𝑥 ) + 𝜇2 𝑥 − 1 − 𝛴 𝑥
Above will become the KL loss term in the implementation
14
Reparameterization Trick Because there is sampling involved, backpropagation through a random node is involved
Sample
15
VAE Loss Function
Gradient of Loss with respect to θ
ELBO = E[log 𝑃 𝑋 𝑍 ] − 𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃(𝑍)
Loss in VAE = 𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃(𝑍) - E[log 𝑃 𝑋 𝑍 ]
How do we take gradient of an expectation?
16
Gradient of Expectation What if z follows
17
VAE – Gradient of Expectation
18
Reparameterization Trick Two approaches to computing the loss of the VAE
1.
2. Use Reparameterization Trick (Introduce a new parameter ϵ allows that allows
us to reparametrize z in a way that allows backprop to flow through the
deterministic nodes.
19
Reparameterization Trick
General Idea:
20
VAE Implementation - PyTorch
class Utils(object):def PrepareData(self):
batch_size = 100# get Datasettrain_dataset = datasets.MNIST(root='./mnist_data/', train=True,
transform=transforms.ToTensor(), download=True)test_dataset = datasets.MNIST(root='./mnist_data/', train=False,
transform=transforms.ToTensor(), download=True)
# Data Loader train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
batch_size=batch_size, shuffle=True)test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
batch_size=batch_size, shuffle=False)return train_loader, test_loader
21
VAE Implementationclass VAEModel(nn.Module):
def __init__(self, x_dim, h_dim1, h_dim2, z_dim):
super(VAEModel, self).__init__()
# ----encoder components
self.fce1 = nn.Linear(x_dim, h_dim1)
self.fce2 = nn.Linear(h_dim1, h_dim2)
self.fcMu = nn.Linear(h_dim2, z_dim)
self.fcSigma = nn.Linear(h_dim2, z_dim)
# ----decoder components
self.fcd1 = nn.Linear(z_dim, h_dim2)
self.fcd2 = nn.Linear(h_dim2, h_dim1)
self.fcdout = nn.Linear(h_dim1, x_dim)
def encoder(self, x):
h = F.relu(self.fce1(x))
h = F.relu(self.fce2(h))
return self.fcMu(h), self.fcSigma(h) # mu, log_var
def reparameter_sampling(self, mu, log_var):
std = torch.exp(0.5*log_var)
eps = torch.randn_like(std)
return eps.mul(std).add_(mu) # return z sample22
def decoder(self, z):h = F.relu(self.fcd1(z))h = F.relu(self.fcd2(h))return F.sigmoid(self.fcdout(h))
def forward(self, x):mu, log_var = self.encoder(x.view(-1, 784))z = self.reparameter_sampling(mu, log_var)out = self.decoder(z)return out, mu, log_var
VAE Implementationdef main():
ngpu = 1
device = torch.device("cuda:0" if (torch.cuda.is_available() and ngpu > 0) else "cpu")
# prepare data loaders
utils = Utils()
train_loader, test_loader = utils.PrepareData()
# build model
vaemodel = VAEModel(x_dim=784, h_dim1= 512, h_dim2=256, z_dim=2)
if torch.cuda.is_available():
vaemodel.cuda()
epochs = 10
optimizer = optim.Adam(vaemodel.parameters())
train(epochs, vaemodel, train_loader, test_loader, optimizer)
with torch.no_grad():
z = torch.randn(64, 2).cuda() # random sample
gen_sample = vaemodel.decoder(z).cuda()23
VAE Implementationdef loss_function(recon_x, x, mu, log_var):
BCE = F.binary_cross_entropy(recon_x, x.view(-1, 784), reduction='sum')
KLD = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
return BCE + KLD
def train(epochs, vaemodel, train_loader, test_loader, optimizer):
vaemodel.train() # set it in train mode
train_loss = 0
for i in range(epochs):
for batch_idx, (data, _) in enumerate(train_loader):
data = data.cuda()
optimizer.zero_grad() # clear gradients
recon_batch, mu, log_var = vaemodel(data)
loss = loss_function(recon_batch, data, mu, log_var)
loss.backward()
train_loss += loss.item()
optimizer.step()24
Results – VAE Simple
25
Variations in VAEs Loss in VAE = 𝐷𝐾𝐿 𝑄 𝑍 𝑋 𝑃(𝑍) - E[log𝑃 𝑋 𝑍 ]
Problem:
The encoder may learn just P(z) independent of x
The decoder may just memorize and generate x independent of z
Use weighting of 𝐷𝐾𝐿 term
Above actually makes the encoding worse.
Solution: Use MMD instad of DKL
MMD – two distributions are equal if their moments are same
26
VAE – Data Generation
27