adversarial learning
TRANSCRIPT
Adversarial LearningoWhat is a GAN?oSome mathematical backgroundoAlgorithms for training a GANoThe Wasserstein GANoConditional GANs
How do we sample from a distribution?
𝑌 ∼ 𝑝! 𝑦!Parametric methods:
– Some known parametric distribution:
𝑝! 𝑦 =1𝑍 exp −𝑢 𝑦
– Monte Carlo Markov Chain (MCMC) methods– Hastings-Metropolis sampling– Requires a known distribution
!Non-parametric methods:– Provided with samples 𝑦", ⋯ , 𝑦#$%– Infer distribution from samples– Generator
𝑌 = ℎ" 𝑍
1
https://sdv.dev/Copulas/tutorials/03_Multivariate_Distributions.html
Multivariate random vector in ℜ!
Multivariate density in ℜ!
Source of randomness; 𝑍 ∼ 𝑁 0, 𝐼
Random vector with desired distribution
Training a Generator
!Function of GAN:– Generates samples, -𝑦&, with the same distribution as 𝑦&.– Can use drop-outs to generate randomness
!Training algorithm:– Compare the distributions of -𝑦& and 𝑦&.– Feedback parameter corrections
(𝑦"generatedsamples
Generator(𝑦" = ℎ# 𝑧"
TrainingAlgorithm
𝑧"independent noise source
𝑦"referencesamples
𝜃
The Bayes Discriminator
!Use Bayes rule:𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝑅|𝑦 =
𝑃 𝑦|𝐶𝑙𝑎𝑠𝑠 = 𝑅 𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝑅𝑃 𝑦|𝐶𝑙𝑎𝑠𝑠 = 𝑅 𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝑅 + 𝑃 𝑦|𝐶𝑙𝑎𝑠𝑠 = 𝐹 𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝐹
• Assuming 𝑃 𝑦|𝐶𝑙𝑎𝑠𝑠 = 𝐹 = 𝑝!$ 𝑦 , 𝑃 𝑦|𝐶𝑙𝑎𝑠𝑠 = 𝑅 = 𝑝" 𝑦 , 𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝑅 =𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝐹 = #
$ , we have that
𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝑅|𝑦 = =𝑝% 𝑦 ½
𝑝% 𝑦 ½ + 𝑝#! 𝑦 ½=
11 + 𝑅#! 𝑦
– where 𝑅#! 𝑦 =&"! '
&# 'is the likelihood ratio defined by
(𝑦"generatedsamples
Generator(𝑦" = ℎ#! 𝑧"
𝑧"independentnoise source
𝑦"referencesamples
“R -Real”
“F - Fake”
What is the probability that an observation, 𝑦, is real?
Implementing a Bayes Discriminator
!Bayesian Discriminator:𝑓'( 𝑦 ≈ 𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝑅|𝑦
= !% "!% " #!&' "
= $$#%&' "
where
𝑅'! 𝑦 =𝑝'! 𝑦𝑝! 𝑦
Likelihood Ratio
(𝑦"generatedsamples
Generator(𝑦" = ℎ#! 𝑧"
Discriminator (𝑝" = 𝑓#( (𝑦"
𝑧"independentnoise source
𝑦"referencesamples
Discriminator 𝑝" = 𝑓#( 𝑦"
Should be mostly 0s
Should be mostly 1s
Bayes Discriminator and the Likelihood Ratio
!Bayesian Discriminator:𝑓'( 𝑦 ≈ 𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝑅|𝑦
= !% "!% " #!&' "
= $$#%&' "
generated distribution 𝑝#! 𝑦
reference distribution 𝑝% 𝑦
𝑦
likelihood ratio 𝑅#! 𝑦
1
Classify as “real” Classify as “fake”
𝑦
Training a Bayes Discriminator
!Discriminator loss function:7𝑑 𝜃), 𝜃( =
1𝐾;&*"
#$%
− log 𝑓'( 𝑦& − log 1 − 𝑓'( -𝑦&
!Optimal discriminator parameter:
𝜃(∗ = argmin'"
7𝑑 𝜃), 𝜃(
– Results in ML estimate of Bayes classifier parameters.
(𝑦"generatedsamples
Generator(𝑦" = ℎ#! 𝑧"
Discriminator(𝑝" = 𝑓#( (𝑦"
𝑧"
𝑦"
𝑦"referencesamples
0 =Generated
Discriminator𝑝" = 𝑓#( 𝑦"
1 =Reference+ 9𝑑 𝜃) , 𝜃(
CrossEntropy( (𝑝!, 0)
CrossEntropy(𝑝!, 1)𝑝"
(𝑝"
Training a Generator
(𝑦"generatedsamples
Generator(𝑦" = ℎ#! 𝑧"
Discriminator (𝑝" = 𝑓#( (𝑦"
𝑧" <𝑔 𝜃) , 𝜃(Loss Function
𝐿 (𝑝!
!Big idea: Maximize the probability that outputs of the generator are classified as being from the reference distribution.
– /𝑝/ should be large– 𝐿 /𝑝/ should be small when /𝑝/ is large– 𝐿 /𝑝/ should be a decreasing function of /𝑝/
!Generator loss function:
1𝑔 𝜃0, 𝜃1 =1𝐾6/23
45#
𝐿 𝑓!1 /𝑦/
!Optimal generator parameter:
𝜃0∗ = argmin!$
𝑔 𝜃0, 𝜃1
Loss should encourage(𝑝" to be large
Generative Adversarial Network (GAN)*
(𝑦"generatedsamples
Generator(𝑦" = ℎ#! 𝑧"
Discriminator (𝑝" = 𝑓#( (𝑦"
𝑧"
𝑦"
𝑦"referencesamples
0 =Generated
Discriminator 𝑝" = 𝑓#( 𝑦"
1 =Reference + 9𝑑 𝜃) , 𝜃(
Loss Function𝐿 (𝑝!
<𝑔 𝜃) , 𝜃(
CrossEntropy( (𝑝!, 0)
CrossEntropy(𝑝!, 1)
!Generator loss function:
D𝑔 𝜃), 𝜃( =1𝐾;&*"
#$%
𝐿 𝑓'( -𝑦&
!Discriminator loss function:7𝑑 𝜃), 𝜃( =
1𝐾;&*"
#$%
− log 𝑓'( 𝑦& − log 1 − 𝑓'( -𝑦&
*Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, “Generative Adversarial Networks”, Proc. of the Intern. Conference on Neural Information Processing Systems (NIPS 2014). pp. 26
GAN: Expected Loss Functions
>𝑌#!generatedsamples
Generator>𝑌 = ℎ#! 𝑍
Discriminator >𝑃#! = 𝑓#(
>𝑌#!𝑍
𝑌
𝑌referencesamples
0 =Generated
Discriminator 𝑃 = 𝑓#( 𝑌
1 =Reference + 𝑑 𝜃) , 𝜃(
Loss Function𝐿 7𝑃"!
𝑔 𝜃) , 𝜃(
CrossEntropy( 7𝑃, 0)
CrossEntropy(𝑃, 1)
!Generator loss function:𝑔 𝜃), 𝜃( = 𝐸 𝐿 𝑓'( H𝑌'!
!Discriminator loss function:𝑑 𝜃), 𝜃( = 𝐸 − log 𝑓'( 𝑌 + 𝐸 −log 1 − 𝑓'( H𝑌'!
– By the weak and strong law of large numbers lim#→-
D𝑔 = 𝑔 and lim#→-
7𝑑 = 𝑑
Generator Loss Function Choices
!Option 0: Original loss function proposed in [1].– 𝐿 𝑝 = log 1 − 𝑝 𝐿 0 = 0; 𝐿 1 = −∞
– 𝑔 𝜃), 𝜃( = 𝐸 log 1 − 𝑓'( H𝑌'!– Presented as key theoretically grounded approach in Goodfellow paper.– Consistent with zero-sum game Nash equilibrium theory– Almost no one uses it.
!Option 1: “Non-saturating” loss function, i.e., the “-log D trick”– 𝐿 𝑝 = − log 𝑝 𝐿 0 = +∞; 𝐿 1 = 0
– 𝑔 𝜃), 𝜃( = 𝐸 − log 𝑓'( H𝑌'!– Mentioned as trick in [1] to keep the training loss from “saturating”.– This is what you get if you use cross-entropy loss for generator – This is what is commonly done.
[1] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio. “Generative Adversarial Networks”, Proc. of the Intern. Conference on Neural Information Processing Systems (NIPS 2014). pp. 26
GAN Architecture(Non-Saturating)
>𝑌#!generatedsamples
Generator>𝑌 = ℎ#! 𝑍
Discriminator >𝑃#! = 𝑓#(
>𝑌#!𝑍
𝑌
𝑌referencesamples
0 =Generated
Discriminator 𝑃 = 𝑓#( 𝑌
1 =Reference + 𝑑 𝜃) , 𝜃(
CrossEntropy(𝑃, 1) 𝑔 𝜃) , 𝜃(
CrossEntropy( 7𝑃, 0)
CrossEntropy(𝑃, 1)
!Generator loss function:𝑔 𝜃), 𝜃( = 𝐸 − log 𝑓'( H𝑌'!
!Discriminator loss function:𝑑 𝜃), 𝜃( = 𝐸 − log 𝑓'( 𝑌 + 𝐸 −log 1 − 𝑓'( H𝑌'!
– By the weak and strong law of large numbers lim#→-
D𝑔 = 𝑔 and lim#→-
7𝑑 = 𝑑
GAN Equilibrium Conditions (Non-Saturating)
!We would like to find the solution to:
𝜃)∗ = argmin'!
𝑔 𝜃), 𝜃(∗
= argmin'!
𝐸 − log 𝑓'( H𝑌'!
𝜃(∗ = argmin'"
𝑑 𝜃)∗, 𝜃(
= argmin'"
𝐸 − log 𝑓'( 𝑌 + 𝐸 −log 1 − 𝑓'( H𝑌'!
– This is known as a Nash Equilibrium– We would like it to converge to 𝑅∗ = 1 ⇒ (generated = reference distributions)
!How do we solve this?
!Will it converge?
Nash Equilibrium with Two Agents*
!Agent 𝐺:– Controls parameter 𝜃0– Goal is to minimize 𝑔 𝜃0, 𝜃1∗
!Agent 𝐷– Controls parameter 𝜃1– Goal is to minimize 𝑑 𝜃0∗, 𝜃1
𝑔 𝜃O, 𝜃Pminimize meter
knob
𝜃O
𝑑 𝜃O, 𝜃Pminimize meter
knob
𝜃P
𝐺 Agent 𝐷 Agent
𝜃O
𝜃P
!Each Agent tries to minimize their meter
*Graphics and art reproduced from “stick figure” Wiki page.
𝑔 𝜃)∗, 𝜃(∗ = min'!
𝑔 𝜃), 𝜃(∗
𝑑 𝜃)∗, 𝜃(∗ = min'"
𝑑 𝜃)∗, 𝜃(
Zero-Sum Game: Special Nash Equilibrium*
!Agent 𝐺:– Goal is to minimize 𝑔 𝜃0, 𝜃1∗
– Goal is to maximize 𝑑 𝜃0, 𝜃1∗
!Agent 𝐷– Goal is to minimize 𝑑 𝜃0∗, 𝜃1
𝑑 𝜃O, 𝜃Pmaximize meter
knob
𝜃O
𝑑 𝜃O, 𝜃Pminimize meter
knob
𝜃P
𝐺 Agent 𝐷 Agent
𝜃O
𝜃P
!Special case when 𝑔 𝜃&, 𝜃' = −𝑑 𝜃&, 𝜃'
*Graphics and art reproduced from “stick figure” Wiki page.
𝑑 𝜃)∗, 𝜃(∗ = max'!
𝑑 𝜃), 𝜃(∗
𝑑 𝜃)∗, 𝜃(∗ = min'"
𝑑 𝜃)∗, 𝜃(
Adversarial relationship
Computing the GAN Equilibrium
!Reparameterizing equations
!Alternating minimization ⇒ mode collapse
!Generator loss gradient descent
!Practical convergence issues
Reparameterize Loss Functions!Goal: Replace 𝜃O and 𝜃' with 𝑅 and 𝑓
!Generator parameter 𝑅:– Generated samples are
)𝑌 ∼ 𝑅 𝑦 𝑝Q 𝑦 = 𝑝R 𝑦
where 𝑅 𝑦 =.#! /
.$ /
!Discriminator parameter 𝑓:– Discriminator is
𝑓 𝑦 = 𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝑅|𝑦
!Important facts:– 𝐸 𝑅(𝑌) = 1– Ω) = 𝑅:ℜ0 → 0,∞ such that 𝐸 𝑅 𝑌 = 1– Ω( = 𝑓:ℜ0 → 0,1– For any function ℎ 𝑦 , 𝐸 ℎ H𝑌 = 𝐸 ℎ 𝑌 𝑅(𝑌)
GAN Equilibrium Conditions
!We would like to find the solution to:
𝑅∗ = arg minR∈a@
𝑔 𝑅, 𝑓∗
𝑓∗ = arg minb∈aA
𝑑 𝑅∗, 𝑓
– This is known as a Nash Equilibrium– We would like it to converge to 𝑅∗ = 1 ⇒ (generated = reference distributions)
!How do we do this?
!Will it converge?
Reparameterized Loss Functions!Generator loss function:
𝑔 𝑅, 𝑓 = 𝐸 − log 𝑓 H𝑌= 𝐸 −𝑅(𝑌) log 𝑓 𝑌
!Discriminator loss function:𝑑 𝑅, 𝑓 = 𝐸 − log 𝑓 𝑌 + 𝐸 −log 1 − 𝑓 H𝑌
= 𝐸 − log 𝑓 𝑌 + 𝐸 −𝑅 𝑌 log 1 − 𝑓 𝑌
!Nash equilibrium:
𝑅∗ = arg min1∈3!
𝑔 𝑅, 𝑓∗
𝑓∗ = arg min4∈3"
𝑑 𝑅∗, 𝑓
Method 1: Alternating Minimization!Algorithm
!Discriminator update
𝑓∗ 𝑦 ←1
1 + 𝑅 𝑦!Generator update
𝑅∗ 𝑦 ← 𝛿 𝑦 − 𝑦5 where 𝑦5 = max/𝑓 𝑦
Repeat {𝑓∗ ← arg min
)∈+9𝑑 𝑅∗, 𝑓
𝑅∗ ← arg min%∈+'
𝑔 𝑅, 𝑓∗
} Doesn’t Work!
Problem:• This is called “Mode collapse”• Only generates sample that the discriminator likes best• Intuition:
“We come from France.” “I like cheese steaks.”“Too good to be true.”“Too creepy to be real.”
Method 2: Generator Loss Gradient Descent (GLGD)!Algorithm
!Discriminator update
𝑓∗ 𝑦 ←1
1 + 𝑅 𝑦
!Generator update– Take a step in the negative direction of the generator loss gradient.– 𝑃: - project into the allow parameter space. (This is not an issue in practice.)
𝑃:𝑑 𝑦 = 𝑑 𝑦 −𝐸 𝑑(𝑌)𝐸 𝑝"(𝑌)
𝑝"(𝑦)
!Questions/Comments:– Can be applied with a wide variety of generator/discriminator loss functions– Does this converge?– If so, then what (if anything) is being minimized?
*Martin Arjovsky and Leon Bottou, “Towards Principled Methods for Training Generative Adversarial Networks”, ICLR 2017.
Repeat {𝑓∗ ← arg min
b∈aA𝑑 𝑅, 𝑓
𝑅 ← 𝑅 − 𝛼𝑃f∇R𝑔 𝑅, 𝑓∗}
My term
gradient descent stepProjection onto valid
parameter space
GLGD Convergence for Non-Saturating GAN
*This is an equivalent expression to Theorem 2.5 of [1].[1] Martin Arjovsky and Leon Bottou, “Towards Principled Methods for Training Generative Adversarial Networks”, ICLR 2017.
!For non-saturating GAN when 𝑓1∗ 𝑦 = arg min4∈3"
𝑑 𝑅, 𝑓 , then it can be shown that
∇1 𝑔 𝑅, 𝑓∗ = ∇1𝐶 𝑅
where*
𝐶 𝑅 = 𝐸 1 + 𝑅 𝑌 log 1 + 𝑅 𝑌
!Conclusions:– GLGD is really a gradient descent algorithm for the cost function 𝐶 𝑅 .– 1 + 𝑥 log 1 + 𝑥 is a strictly convex function 𝑥: – Therefore, we know that 𝐶 𝑅 has a unique global minimum at 𝑅 𝑌 = 1.– However, convergence tends to be slow.
Repeat {𝑓∗ ← arg min
)∈+9𝑑 𝑅, 𝑓
𝑅 ← 𝑅 − 𝛼𝑃,∇%𝑔 𝑅, 𝑓∗}
More Details on GLGD
*Martin Arjovsky and Leon Bottou, “Towards Principled Methods for Training Generative Adversarial Networks”, ICLR 2017.
!Arjovsky and Bottou showed that for the non-saturating GAN*𝑃5 ∇1 𝑔 𝑅, 𝑓∗ = ∇1 𝐾𝐿 𝑝1||𝑝! − 2 𝐽𝑆𝐷 𝑝1||𝑝!
!So from previous identities, we have that:
𝑃5 ∇1 𝑔 𝑅, 𝑓∗ = ∇1 2 𝐾𝐿 𝑝1 + 𝑝! /2||𝑝!
= 𝑃5∇1𝐸 1 + 𝑅 𝑌 log 1 + 𝑅 𝑌
!Conclusions:– GLGD is really a gradient descent algorithm for the cost function
𝐶 𝑅 = 2 𝐾𝐿 𝑃; + 𝑃" /2||𝑃"– 𝐶 𝑅 has a unique global minimum at 𝑃; = 𝑃"– However, convergence tends to be slow.
Repeat {𝑓∗ ← arg min
)∈+9𝑑 𝑅, 𝑓
𝑅 ← 𝑅 − 𝛼𝑃,∇%𝑔 𝑅, 𝑓∗}
Convergence of GANs
!Generator and discriminator at convergence:– Reference and generated distribution are the same ⇒ 𝑅∗ 𝑌 = 1
– Discriminator can not distinguish distributions ⇒ 𝑓∗ 𝑦 = ##<; =
= #$
!At convergence the generated and reference distributions are identical. – Therefore, the likelihood ratio is 𝑅 𝑦 = 1; – The generated (fake) and reference (real) distributions are identical;– The discriminator assigns a 50/50 probability to either case because they are
indistinguishable.
– Then both the generator are discriminator cross-entropy losses are −log #$ ≈ 0.693.
!In practice, things don’t usually work out this well…
Method 2: Practical Algorithm!Algorithm
!What you would like to see
Repeat {For 𝑁1 iterations {
𝐵 ← 𝐺𝑒𝑡𝑅𝑎𝑛𝑑𝑜𝑚𝐵𝑎𝑡𝑐ℎ𝜃1 ← 𝜃1 − 𝛽∇!*𝑑 𝜃0, 𝜃1; 𝐵
}𝐵 ← 𝐺𝑒𝑡𝑅𝑎𝑛𝑑𝑜𝑚𝐵𝑎𝑡𝑐ℎ𝜃0 ← 𝜃0 − 𝛼∇!$𝑔 𝜃0, 𝜃1; 𝐵}
Iteration #
Loss
Generator Loss
½ Discriminator Loss
− log12 ≈ 0.693
Looks good, but…– Could result from a discriminator with insufficient capacity
Failure Mode: Mode Collapse!Algorithm
!Sometimes you get mode collapse
Iteration #
Loss
Generator Loss
Discriminator dominates
½ Discriminator Loss
≈ 0.693
Repeat {For 𝑖 = 0 to 𝑁' − 1 {
𝜃' ← 𝜃' − 𝛽∇-9𝑑 𝜃&, 𝜃'}𝜃& ← 𝜃& − 𝛼∇-'𝑔 𝜃&, 𝜃'}
Might be caused by:– Overfitting by discriminator– Insufficient number of discriminator updates– Insufficient generator capacity
Concept Wasserstein GAN
*Martin Arjovsky, Soumith Chintala and Leon Bottou, “Wasserstein Generative Adversarial Networks”, ICML 2017.
!Problem with GAN training using “ − log𝐷 trick”– Slow and sometimes unstable convergence– Problems with vanishing gradient
!Conjecture:– The problem is caused by the discriminator function.– Bayes classifier is
• Too sensitive and too nonlinear• Non-overlapping distributions create vanishing gradients that slow convergence.
!Base discriminator of the Wasserstein distance (i.e., earth mover distance)
*Reproduced from paper
Wasserstein Fundamentals
!Based on Kantorovich-Rubinstein duality (Villani, 2009)*
𝑊 𝑝Q||𝑝R = supb Bmn
𝐸 𝑓 𝑌 − 𝐸 𝑓 )𝑌R
– where 𝑓 6 is the Lipschitz constant of 𝑓
– 𝑓 6 ≤ 1 is referred to as the 1-Lipschitz condition
*Villani, Cedric. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer, Berlin, 2009.
Wasserstein GAN*!Then the fundamental result of the Arjovsky and Bottou
– Define the sets 𝑅 ∈ Ω) as usual, but define and 𝑓 ∈ Ω) so that
Ω'. = 𝑓:ℜ/ → −∞,∞ 𝑠. 𝑡. 𝑓 0 ≤ 1
– Then define Wasserstein generator and discriminator loss functions as𝑔1 𝑅, 𝑓 = 𝐸 −𝑓 L𝑌%
𝑑1 𝑅, 𝑓 = 𝐸 𝑓 L𝑌% − 𝐸 𝑓 𝑌%
!Key result from Arjovsky paper*:∇%𝑔1 𝑅, 𝑓∗ = ∇%𝑊(𝑝2||𝑝%)
– where𝑓∗ = arg min
)∈39>𝑑4 𝑅, 𝑓
*Martin Arjovsky, Soumith Chintala, and Leon Bottou, “Wasserstein Generative Adversarial Networks”, ICML 2017.
Method 3: Wasserstein Algorithm!Algorithm
!Discriminator update– How do we solve the problem of minimizing the discriminator loss with the Lipschitz
constraint?– Answer: We clip the discriminator weights during training.– Observation: Isn’t this just regularization of the discriminator DNN??
!Generator update– Take a step in the negative direction of the generator loss gradient descent.
*Martin Arjovsky and Leon Bottou, “Towards Principled Methods for Training Generative Adversarial Networks”, ICLR 2017.
Repeat {𝑓∗ = arg min
b∈pAC𝑑q 𝑅, 𝑓
𝑅 ← 𝑅 − 𝛼𝑃f∇R𝑔q 𝑅, 𝑓∗}
Method 3: Wasserstein Practical Algorithm!Algorithm
!Observations:– Some people seem to feel the Wasserstein GAN has better convergence.– However, is this because of the Wasserstein metric?– Or is it because of the other algorithmic improvements?
*Martin Arjovsky and Leon Bottou, “Towards Principled Methods for Training Generative Adversarial Networks”, ICLR 2017.
Make sure to get new batches
Iterate discriminator to approximate convergence
Minimize discriminator loss
Clip discriminator weights to approximate Lipschitz constraint
Take gradient step of generator loss
Conditional Generative Adversarial Network
!Generates samples from the conditional distribution of 𝑌 given 𝑋.
!Descriminator takes 𝑦5, 𝑥5 input pairs for 𝑘 = 0,⋯ , 𝐾 − 1
(𝑦"- GeneratedGenerator
(𝑦" = ℎ#! 𝑥" , 𝑧"
𝑥" Discriminator (𝑝" = 𝑓#( (𝑦" , 𝑥"
𝑧"
𝑦"𝑦"- Reference
0 =Generated
𝑥" , 𝑦"
Discriminator 𝑝" = 𝑓#( 𝑦" , 𝑥"
1 =Reference + 𝑑 𝜃) , 𝜃(
1 = Reference
CrossEntropy( (𝑝!, 1) 𝑔 𝜃) , 𝜃(
CrossEntropy( (𝑝!, 0)
CrossEntropy(𝑝!, 1)
Reference distribution
𝑥"
𝑥"