generative adversarial networks - ian goodfello · generative adversarial networks 1 ... peaked...
TRANSCRIPT
![Page 1: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/1.jpg)
Generative Adversarial Networks
1
presented by Ian Goodfellow
presentation co-developed with Aaron Courville
![Page 2: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/2.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
In today’s talk…
• “Generative Adversarial Networks” Goodfellow et al., NIPS 2014
• “Conditional Generative Adversarial Nets” Mirza and Osindero, NIPS Deep Learning Workshop 2014
• “On Distinguishability Criteria for Estimating Generative Models” Goodfellow, ICLR Workshop 2015
• “Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks” Denton, Chintala, et al., ArXiv 2015
2
![Page 3: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/3.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
Generative modeling
• Have training examples x ~ pdata(x )• Want a model that can draw samples: x ~ pmodel(x )• Where pmodel ≈ pdata
3
x ~ pdata(x ) x ~ pmodel(x )
![Page 4: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/4.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
Why generative models?
• Conditional generative models- Speech synthesis: Text ⇒ Speech- Machine Translation: French ⇒ English
• French: Si mon tonton tond ton tonton, ton tonton sera tondu.• English: If my uncle shaves your uncle, your uncle will be shaved
- Image ⇒ Image segmentation
• Environment simulator- Reinforcement learning- Planning
• Leverage unlabeled data?4
![Page 5: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/5.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
Maximum likelihood: the dominant approach
• ML objective function
5
✓
⇤= max
✓
1
m
mX
i=1
log p
⇣x
(i); ✓
⌘
1
![Page 6: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/6.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
Undirected graphical models
• Flagship undirected graphical model: Deep Boltzmann machines
• Several “hidden layers” h
6
p(h, x) =1
Zp̃(h, x)
p̃(h, x) = exp(�E(h, x))
Z =�
h,x
p̃(h, x)
h(1)
h(2)
h(3)
x
![Page 7: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/7.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow 7
Boltzmann Machines: disadvantage
• Model is badly parameterized for learning high quality samples: peaked distributions -> slow mixing
• Why poor mixing?
MNIST dataset 1st layer features (RBM)
Coordinated flipping of low-level features
![Page 8: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/8.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
Directed graphical models
• Two problems:1. Summation over exponentially many states in h2. Posterior inference, i.e. calculating p(h | x), is intractable.
8
✓
⇤= max
✓
1
m
mX
i=1
log p
⇣x
(i); ✓
⌘
p(h, x) =
1
Z
p̃(h, x)
p̃(h, x) = exp (�E (h, x))
Z =
X
h,x
p̃(h, x)
d
d✓
i
log p(h, x) =
d
d✓
i
[log p̃(h, x)� logZ(✓)]
d
d✓
i
logZ(✓) =
d
d✓iZ(✓)
Z(✓)
p(x, h) = p(x | h(1))p(h
(1) | h(2)) . . . p(h
(L�1) | h(L))p(h
(L))
1
h(1)
h(2)
h(3)
x
d
d�ilog p(x) =
1
p(x)
d
d�ip(x)
p(x) =�
h
p(x | h)p(h)
![Page 9: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/9.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
Variational Autoencoder
9
E[x|z]
Differentiable decoder
x sampled from data
Differentiable encoder
Sample from q(z)
Noise
z
x
BRIEF ARTICLE
THE AUTHOR
Maximize log p(x)�DKL (q(x)kp(z | x))
1
(Kingma and Welling, 2014, Rezende et al 2014)
![Page 10: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/10.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
Generative stochastic networks
• General strategy: Do not write a formula for p(x), just learn to sample incrementally.
• Main issue: Subject to some of the same constraints on mixing as undirected graphical models.
10
...
(Bengio et al 2013)
![Page 11: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/11.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
Generative adversarial networks
• Don’t write a formula for p(x), just learn to sample directly.
• No Markov Chain
• No variational bound
• How? By playing a game.
11
![Page 12: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/12.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
Game theory: the basics
• N>1 players• Clearly defined set of actions each player can take• Clearly defined relationship between actions and
outcomes• Clearly defined value of each outcome• Can’t control the other player’s actions
12
![Page 13: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/13.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
Two-player zero-sum game
• Your winnings + your opponent’s winnings = 0• Minimax theorem: a rational strategy exists for all
such finite games
13
![Page 14: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/14.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
• Strategy: specification of which moves you make in which circumstances.
• Equilibrium: each player’s strategy is the best possible for their opponent’s strategy.
• Example: Rock-paper-scissors:
- Mixed strategy equilibrium
- Choose your action at random
14
0 -1 1
1 0 -1
-1 1 0
You
Your opponentRock Paper Scissors
Rock
Pape
rSc
issor
s
Two-player zero-sum game
![Page 15: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/15.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
Adversarial nets framework
15
• A game between two players:1. Discriminator D 2. Generator G
• D tries to discriminate between: - A sample from the data distribution. - And a sample from the generator G.
• G tries to “trick” D by generating samples that are hard for D to distinguish from data.
![Page 16: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/16.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
Adversarial nets framework
16
Input noiseZ
Differentiable function G
x sampled from model
Differentiable function D
D tries to output 0
x sampled from data
Differentiable function D
D tries to output 1
xx
z
![Page 17: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/17.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
• Minimax value function:
Zero-sum game
17
In other words, D and G play the following two-player minimax game with value function V (G,D):
min
G
max
D
V (D,G) = Ex⇠pdata(x)[logD(x)] + E
z⇠pz(z)[log(1�D(G(z)))]. (1)
In the next section, we present a theoretical analysis of adversarial nets, essentially showing thatthe training criterion allows one to recover the data generating distribution as G and D are givenenough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogicalexplanation of the approach. In practice, we must implement the game using an iterative, numericalapproach. Optimizing D to completion in the inner loop of training is computationally prohibitive,and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizingD and one step of optimizing G. This results in D being maintained near its optimal solution, solong as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29]training maintains samples from a Markov chain from one learning step to the next in order to avoidburning in a Markov chain as part of the inner loop of learning. The procedure is formally presentedin Algorithm 1.
In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning,when G is poor, D can reject samples with high confidence because they are clearly different fromthe training data. In this case, log(1 � D(G(z))) saturates. Rather than training G to minimizelog(1�D(G(z))) we can train G to maximize logD(G(z)). This objective function results in thesame fixed point of the dynamics of G and D but provides much stronger gradients early in learning.
. . .
(a) (b) (c) (d)
Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution(D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black,dotted line) p
x
from those of the generative distribution pg (G) (green, solid line). The lower horizontal line isthe domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domainof x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg ontransformed samples. G contracts in regions of high density and expands in regions of low density of pg . (a)Consider an adversarial pair near convergence: pg is similar to pdata and D is a partially accurate classifier.(b) In the inner loop of the algorithm D is trained to discriminate samples from data, converging to D
⇤(x) =pdata(x)
pdata(x)+pg(x) . (c) After an update to G, gradient of D has guided G(z) to flow to regions that are more likelyto be classified as data. (d) After several steps of training, if G and D have enough capacity, they will reach apoint at which both cannot improve because pg = pdata. The discriminator is unable to differentiate betweenthe two distributions, i.e. D(x) = 1
2 .
4 Theoretical Results
The generator G implicitly defines a probability distribution p
g
as the distribution of the samplesG(z) obtained when z ⇠ p
z
. Therefore, we would like Algorithm 1 to converge to a good estimatorof pdata, if given enough capacity and training time. The results of this section are done in a non-parametric setting, e.g. we represent a model with infinite capacity by studying convergence in thespace of probability density functions.
We will show in section 4.1 that this minimax game has a global optimum for pg
= pdata. We willthen show in section 4.2 that Algorithm 1 optimizes Eq 1, thus obtaining the desired result.
3
Generator pushes down
Discriminator pushes up
Discriminator’s ability to
recognize data as being real
Discriminator’sability to recognize generator
samples as being fake
![Page 18: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/18.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
Discriminator strategy
• Optimal strategy for any pmodel(x) is always
18
✓
⇤= max
✓
1
m
mX
i=1
log p
⇣x
(i); ✓
⌘
p(h, x) =
1
Z
p̃(h, x)
p̃(h, x) = exp (�E (h, x))
Z =
X
h,x
p̃(h, x)
d
d✓
i
log p(x) =
d
d✓
i
"log
X
h
p̃(h, x)� logZ(✓)
#
d
d✓
i
logZ(✓) =
d
d✓iZ(✓)
Z(✓)
p(x, h) = p(x | h(1))p(h
(1) | h(2)) . . . p(h
(L�1) | h(L))p(h
(L))
d
d✓
i
log p(x) =
d
d✓ip(x)
p(x)
p(x) =
X
h
p(x | h)p(h)
D(x) =
p
data
(x)
p
data
(x) + p
model
(x)
1
![Page 19: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/19.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
Learning process
19
In other words, D and G play the following two-player minimax game with value function V (G,D):
min
G
max
D
V (D,G) = Ex⇠pdata(x)[logD(x)] + E
z⇠pz(z)[log(1�D(G(z)))]. (1)
In the next section, we present a theoretical analysis of adversarial nets, essentially showing thatthe training criterion allows one to recover the data generating distribution as G and D are givenenough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogicalexplanation of the approach. In practice, we must implement the game using an iterative, numericalapproach. Optimizing D to completion in the inner loop of training is computationally prohibitive,and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizingD and one step of optimizing G. This results in D being maintained near its optimal solution, solong as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29]training maintains samples from a Markov chain from one learning step to the next in order to avoidburning in a Markov chain as part of the inner loop of learning. The procedure is formally presentedin Algorithm 1.
In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning,when G is poor, D can reject samples with high confidence because they are clearly different fromthe training data. In this case, log(1 � D(G(z))) saturates. Rather than training G to minimizelog(1�D(G(z))) we can train G to maximize logD(G(z)). This objective function results in thesame fixed point of the dynamics of G and D but provides much stronger gradients early in learning.
. . .
(a) (b) (c) (d)
Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution(D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black,dotted line) p
x
from those of the generative distribution pg (G) (green, solid line). The lower horizontal line isthe domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domainof x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg ontransformed samples. G contracts in regions of high density and expands in regions of low density of pg . (a)Consider an adversarial pair near convergence: pg is similar to pdata and D is a partially accurate classifier.(b) In the inner loop of the algorithm D is trained to discriminate samples from data, converging to D
⇤(x) =pdata(x)
pdata(x)+pg(x) . (c) After an update to G, gradient of D has guided G(z) to flow to regions that are more likelyto be classified as data. (d) After several steps of training, if G and D have enough capacity, they will reach apoint at which both cannot improve because pg = pdata. The discriminator is unable to differentiate betweenthe two distributions, i.e. D(x) = 1
2 .
4 Theoretical Results
The generator G implicitly defines a probability distribution p
g
as the distribution of the samplesG(z) obtained when z ⇠ p
z
. Therefore, we would like Algorithm 1 to converge to a good estimatorof pdata, if given enough capacity and training time. The results of this section are done in a non-parametric setting, e.g. we represent a model with infinite capacity by studying convergence in thespace of probability density functions.
We will show in section 4.1 that this minimax game has a global optimum for pg
= pdata. We willthen show in section 4.2 that Algorithm 1 optimizes Eq 1, thus obtaining the desired result.
3
Poorly fit model After updating D After updating G Mixed strategyequilibrium
Data distributionModel distribution
D(x)
![Page 20: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/20.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
Learning process
20
In other words, D and G play the following two-player minimax game with value function V (G,D):
min
G
max
D
V (D,G) = Ex⇠pdata(x)[logD(x)] + E
z⇠pz(z)[log(1�D(G(z)))]. (1)
In the next section, we present a theoretical analysis of adversarial nets, essentially showing thatthe training criterion allows one to recover the data generating distribution as G and D are givenenough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogicalexplanation of the approach. In practice, we must implement the game using an iterative, numericalapproach. Optimizing D to completion in the inner loop of training is computationally prohibitive,and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizingD and one step of optimizing G. This results in D being maintained near its optimal solution, solong as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29]training maintains samples from a Markov chain from one learning step to the next in order to avoidburning in a Markov chain as part of the inner loop of learning. The procedure is formally presentedin Algorithm 1.
In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning,when G is poor, D can reject samples with high confidence because they are clearly different fromthe training data. In this case, log(1 � D(G(z))) saturates. Rather than training G to minimizelog(1�D(G(z))) we can train G to maximize logD(G(z)). This objective function results in thesame fixed point of the dynamics of G and D but provides much stronger gradients early in learning.
. . .
(a) (b) (c) (d)
Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution(D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black,dotted line) p
x
from those of the generative distribution pg (G) (green, solid line). The lower horizontal line isthe domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domainof x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg ontransformed samples. G contracts in regions of high density and expands in regions of low density of pg . (a)Consider an adversarial pair near convergence: pg is similar to pdata and D is a partially accurate classifier.(b) In the inner loop of the algorithm D is trained to discriminate samples from data, converging to D
⇤(x) =pdata(x)
pdata(x)+pg(x) . (c) After an update to G, gradient of D has guided G(z) to flow to regions that are more likelyto be classified as data. (d) After several steps of training, if G and D have enough capacity, they will reach apoint at which both cannot improve because pg = pdata. The discriminator is unable to differentiate betweenthe two distributions, i.e. D(x) = 1
2 .
4 Theoretical Results
The generator G implicitly defines a probability distribution p
g
as the distribution of the samplesG(z) obtained when z ⇠ p
z
. Therefore, we would like Algorithm 1 to converge to a good estimatorof pdata, if given enough capacity and training time. The results of this section are done in a non-parametric setting, e.g. we represent a model with infinite capacity by studying convergence in thespace of probability density functions.
We will show in section 4.1 that this minimax game has a global optimum for pg
= pdata. We willthen show in section 4.2 that Algorithm 1 optimizes Eq 1, thus obtaining the desired result.
3
Poorly fit model After updating D After updating G Mixed strategyequilibrium
Data distributionModel distribution
D(x)
![Page 21: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/21.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
Learning process
21
In other words, D and G play the following two-player minimax game with value function V (G,D):
min
G
max
D
V (D,G) = Ex⇠pdata(x)[logD(x)] + E
z⇠pz(z)[log(1�D(G(z)))]. (1)
In the next section, we present a theoretical analysis of adversarial nets, essentially showing thatthe training criterion allows one to recover the data generating distribution as G and D are givenenough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogicalexplanation of the approach. In practice, we must implement the game using an iterative, numericalapproach. Optimizing D to completion in the inner loop of training is computationally prohibitive,and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizingD and one step of optimizing G. This results in D being maintained near its optimal solution, solong as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29]training maintains samples from a Markov chain from one learning step to the next in order to avoidburning in a Markov chain as part of the inner loop of learning. The procedure is formally presentedin Algorithm 1.
In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning,when G is poor, D can reject samples with high confidence because they are clearly different fromthe training data. In this case, log(1 � D(G(z))) saturates. Rather than training G to minimizelog(1�D(G(z))) we can train G to maximize logD(G(z)). This objective function results in thesame fixed point of the dynamics of G and D but provides much stronger gradients early in learning.
. . .
(a) (b) (c) (d)
Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution(D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black,dotted line) p
x
from those of the generative distribution pg (G) (green, solid line). The lower horizontal line isthe domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domainof x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg ontransformed samples. G contracts in regions of high density and expands in regions of low density of pg . (a)Consider an adversarial pair near convergence: pg is similar to pdata and D is a partially accurate classifier.(b) In the inner loop of the algorithm D is trained to discriminate samples from data, converging to D
⇤(x) =pdata(x)
pdata(x)+pg(x) . (c) After an update to G, gradient of D has guided G(z) to flow to regions that are more likelyto be classified as data. (d) After several steps of training, if G and D have enough capacity, they will reach apoint at which both cannot improve because pg = pdata. The discriminator is unable to differentiate betweenthe two distributions, i.e. D(x) = 1
2 .
4 Theoretical Results
The generator G implicitly defines a probability distribution p
g
as the distribution of the samplesG(z) obtained when z ⇠ p
z
. Therefore, we would like Algorithm 1 to converge to a good estimatorof pdata, if given enough capacity and training time. The results of this section are done in a non-parametric setting, e.g. we represent a model with infinite capacity by studying convergence in thespace of probability density functions.
We will show in section 4.1 that this minimax game has a global optimum for pg
= pdata. We willthen show in section 4.2 that Algorithm 1 optimizes Eq 1, thus obtaining the desired result.
3
Poorly fit model After updating D After updating G Mixed strategyequilibrium
Data distributionModel distribution
D(x)
![Page 22: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/22.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
Learning process
22
In other words, D and G play the following two-player minimax game with value function V (G,D):
min
G
max
D
V (D,G) = Ex⇠pdata(x)[logD(x)] + E
z⇠pz(z)[log(1�D(G(z)))]. (1)
In the next section, we present a theoretical analysis of adversarial nets, essentially showing thatthe training criterion allows one to recover the data generating distribution as G and D are givenenough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogicalexplanation of the approach. In practice, we must implement the game using an iterative, numericalapproach. Optimizing D to completion in the inner loop of training is computationally prohibitive,and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizingD and one step of optimizing G. This results in D being maintained near its optimal solution, solong as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29]training maintains samples from a Markov chain from one learning step to the next in order to avoidburning in a Markov chain as part of the inner loop of learning. The procedure is formally presentedin Algorithm 1.
In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning,when G is poor, D can reject samples with high confidence because they are clearly different fromthe training data. In this case, log(1 � D(G(z))) saturates. Rather than training G to minimizelog(1�D(G(z))) we can train G to maximize logD(G(z)). This objective function results in thesame fixed point of the dynamics of G and D but provides much stronger gradients early in learning.
. . .
(a) (b) (c) (d)
Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution(D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black,dotted line) p
x
from those of the generative distribution pg (G) (green, solid line). The lower horizontal line isthe domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domainof x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg ontransformed samples. G contracts in regions of high density and expands in regions of low density of pg . (a)Consider an adversarial pair near convergence: pg is similar to pdata and D is a partially accurate classifier.(b) In the inner loop of the algorithm D is trained to discriminate samples from data, converging to D
⇤(x) =pdata(x)
pdata(x)+pg(x) . (c) After an update to G, gradient of D has guided G(z) to flow to regions that are more likelyto be classified as data. (d) After several steps of training, if G and D have enough capacity, they will reach apoint at which both cannot improve because pg = pdata. The discriminator is unable to differentiate betweenthe two distributions, i.e. D(x) = 1
2 .
4 Theoretical Results
The generator G implicitly defines a probability distribution p
g
as the distribution of the samplesG(z) obtained when z ⇠ p
z
. Therefore, we would like Algorithm 1 to converge to a good estimatorof pdata, if given enough capacity and training time. The results of this section are done in a non-parametric setting, e.g. we represent a model with infinite capacity by studying convergence in thespace of probability density functions.
We will show in section 4.1 that this minimax game has a global optimum for pg
= pdata. We willthen show in section 4.2 that Algorithm 1 optimizes Eq 1, thus obtaining the desired result.
3
Poorly fit model After updating D After updating G Mixed strategyequilibrium
Data distributionModel distribution
D(x)
![Page 23: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/23.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
Theoretical properties
• Theoretical properties (assuming infinite data, infinite model capacity, direct updating of generator’s distribution):- Unique global optimum.
- Optimum corresponds to data distribution.
- Convergence to optimum guaranteed.
23
In other words, D and G play the following two-player minimax game with value function V (G,D):
min
G
max
D
V (D,G) = Ex⇠pdata(x)[logD(x)] + E
z⇠pz(z)[log(1�D(G(z)))]. (1)
In the next section, we present a theoretical analysis of adversarial nets, essentially showing thatthe training criterion allows one to recover the data generating distribution as G and D are givenenough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogicalexplanation of the approach. In practice, we must implement the game using an iterative, numericalapproach. Optimizing D to completion in the inner loop of training is computationally prohibitive,and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizingD and one step of optimizing G. This results in D being maintained near its optimal solution, solong as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29]training maintains samples from a Markov chain from one learning step to the next in order to avoidburning in a Markov chain as part of the inner loop of learning. The procedure is formally presentedin Algorithm 1.
In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning,when G is poor, D can reject samples with high confidence because they are clearly different fromthe training data. In this case, log(1 � D(G(z))) saturates. Rather than training G to minimizelog(1�D(G(z))) we can train G to maximize logD(G(z)). This objective function results in thesame fixed point of the dynamics of G and D but provides much stronger gradients early in learning.
. . .
(a) (b) (c) (d)
Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution(D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black,dotted line) p
x
from those of the generative distribution pg (G) (green, solid line). The lower horizontal line isthe domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domainof x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg ontransformed samples. G contracts in regions of high density and expands in regions of low density of pg . (a)Consider an adversarial pair near convergence: pg is similar to pdata and D is a partially accurate classifier.(b) In the inner loop of the algorithm D is trained to discriminate samples from data, converging to D
⇤(x) =pdata(x)
pdata(x)+pg(x) . (c) After an update to G, gradient of D has guided G(z) to flow to regions that are more likelyto be classified as data. (d) After several steps of training, if G and D have enough capacity, they will reach apoint at which both cannot improve because pg = pdata. The discriminator is unable to differentiate betweenthe two distributions, i.e. D(x) = 1
2 .
4 Theoretical Results
The generator G implicitly defines a probability distribution p
g
as the distribution of the samplesG(z) obtained when z ⇠ p
z
. Therefore, we would like Algorithm 1 to converge to a good estimatorof pdata, if given enough capacity and training time. The results of this section are done in a non-parametric setting, e.g. we represent a model with infinite capacity by studying convergence in thespace of probability density functions.
We will show in section 4.1 that this minimax game has a global optimum for pg
= pdata. We willthen show in section 4.2 that Algorithm 1 optimizes Eq 1, thus obtaining the desired result.
3
In practice: no proof that SGD converges
![Page 24: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/24.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
Oscillation
24
(Alec Radford)
![Page 25: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/25.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
Visualization of model samples
25
MNIST TFD
CIFAR-10 (fully connected) CIFAR-10 (convolutional)
![Page 26: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/26.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
Learned 2-D manifold of MNIST
26
![Page 27: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/27.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
1. Draw sample (A)
2. Draw sample (B)
3. Simulate samples along the path between A and B
4. Repeat steps 1-3 as desired.
Visualizing trajectories
27
A
B
![Page 28: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/28.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
Visualization of model trajectories
28
MNIST digit dataset Toronto Face Dataset (TFD)
![Page 29: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/29.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow 29
CIFAR-10 (convolutional)
Visualization of model trajectories
![Page 30: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/30.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
GANs vs VAEs• Both use backprop through continuous random number generation
• VAE:
- generator gets direct output target
- need REINFORCE to do discrete latent variables
- possible underfitting due to variational approximation
- gets global image composition right but blurs details
• GAN:
- generator never sees the data
- need REINFORCE to do discrete visible variables
- possible underfitting due to non-convergence
- gets local image features right but not global structure
30
![Page 31: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/31.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
VAE + GAN
31
(Alec Radford, 2015)
VAE VAE+GAN
-Reduce VAE blurriness-Reduce GAN oscillation
![Page 32: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/32.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
MMD-based generator nets
32
(Li et al 2015) (Dziugaite et al 2015)
![Page 33: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/33.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
Supervised Generator Nets
33
(Dosovitskiy et al 2014)
Generator nets arepowerful—it is our
ability to infer a mapping from an unobserved space
that is limited.
![Page 34: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/34.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
General game
34
![Page 35: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/35.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
Extensions
35
• Inference net:- Learn a network to model p(z | x)
- Wake/Sleep style approach
- Sample z from prior
- Sample x from p(z|x)
- Learn mapping from x to z- Infinite training set!
![Page 36: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/36.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
Extensions
36
• Conditional model:
- Learn p(x | y)- Discriminator is trained
on (x,y) pairs
- Generator net gets y and z as input
- Useful for : Translation, speech synth, image segmentation.
(Mirza and Osindero, 2014)
![Page 37: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/37.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
Laplacian Pyramid
37
(Denton + Chintala, et al 2015)
![Page 38: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/38.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
LAPGAN results• 40% of samples mistaken by humans for real photos
38
(Denton + Chintala, et al 2015)
![Page 39: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/39.jpg)
Deep Learning Workshop, ICML 2015 --- Ian Goodfellow
Open problems
• Is non-convergence a serious problem in practice?
• If so, how can we prevent non-convergence?
• Is there a better loss function for the generator?
39
![Page 40: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited](https://reader030.vdocument.in/reader030/viewer/2022021801/5b38b42f7f8b9a4a728d6c97/html5/thumbnails/40.jpg)
Thank You.
40
Questions?