may 22, 2019 @uva amsterdam - github pages · deep generative modeling jakub m. tomczak deep...

Deep generative modeling

Jakub M. Tomczak

Deep Learning Researcher (Engineer, Staff)

Qualcomm AI Research

Qualcomm Technologies Netherlands B.V.

@UvAMay 22, 2019 Amsterdam

2

Introduction

3

Is generative modeling important?

4


p(panda|x)=0.99

...

The neural network learns to classify images:

5


p(panda|x)=0.99

...

noise

+ =


6


p(panda|x)=0.99

...

noise p(panda|x)=0.01

…

p(dog|x)=0.9

+ =


7


p(panda|x)=0.99

...

noise p(panda|x)=0.01

…

p(dog|x)=0.9

+ =


There is no semantic understanding of images.

8


9


10


11


new

data

12


High probability

of a horse.

=

Highly

probable

decision!

new

data

13


High probability

of a horse.

=

Highly

probable

decision!

High probability

of a horse.

x

Low probability

of the object

=

Uncertain

decision!

new

data

14


High probability

of a horse.

=

Highly

probable

decision!

High probability

of a horse.

x

Low probability

of the object

=

Uncertain

decision!

new

data

15

Where do we use generative modeling?

Image analysis

Reinforcement Learning

Audio analysis

Text analysis

Graph

analysis

and more...Active Learning

Medical data

16

Generative modeling: How?

Generative

model

Autoregressive

(e.g., PixelCNN)

Implicit models

(e.g., GANs)

Prescribed models

(e.g., VAE)

Latent variable

models

Flow-based

(e.g., RealNVP, GLOW)

17

Generative modeling: Pros and cons

Training Likelihood Sampling Compression

Autoregressive

models (e.g.,

PixelCNN)

Stable Yes Slow No

Flow-based models

(e.g., RealNVP)Stable Yes Fast/Slow No

Implicit models

(e.g., GANs)Unstable No Fast No

Prescribed models

(e.g., VAEs)Stable Approximate Fast Yes

18

Generative modeling: Pros and cons

Training Likelihood Sampling Compression

Autoregressive

models (e.g.,

PixelCNN)

Stable Yes Slow No

Flow-based models

(e.g., RealNVP)Stable Yes Fast/Slow No

Implicit models

(e.g., GANs)Unstable No Fast No

Prescribed models

(e.g., VAEs)Stable Approximate Fast Yes

19

Machine learning and (spherical) cows

20


21


22


flow-based models latent variable models

23

Deep latent

variable

models

24

Generative modeling

Modeling in high-dimensional spaces is difficult.

25

Generative modeling


26

Generative modeling


➔ Modeling all dependencies among pixels:

27

Generative modeling



problematic

28

Generative modeling



A possible solution: Latent Variable Models!

problematic

29

Generative process:

Generative modeling with Latent Variables

30

Generative process:


31

Generative process:


32

Generative process:

Log of marginal distribution:


33

Generative process:

Log of marginal distribution:

How to train such model efficiently?


34

Variational inference for Latent Variable Models

35


Variational posterior

36


Jensen’s inequality

37


Reconstruction error Regularization

38


decoder

encoder

prior

39


decoder

encoder

prior

= Variational Auto-Encoder+ reparameterization trick

40

Variational Auto-Encoders

encoder net decoder netcode

● VAE copies input to output through a bottleneck.

● VAE learns a code of the data.

41


0μ

μ

σ




σ

42


μ

σ




0μ

σ

43


0

prior

decoder netcode



● VAE puts a prior on the latent code.

● VAE can generate new data.

44


0

prior

decoder netcode





45


0

prior

decoder netcode





46

Components of VAEs

47

Components of VAEs

Resnets

DRAW

Autoregressive models

Normalizing flows


Normalizing flows

VampPrior

Implicit prior

48

Components of VAEs

Normalizing flows

Discrete encoders

Hyperspherical dist.

Hyperbolic-normal dist.

Group theory

Resnets

DRAW


Normalizing flows


Normalizing flows

VampPrior

Implicit prior

49

Components of VAEs

Normalizing flows

Discrete encoders



Group theory

Resnets

DRAW


Normalizing flows


Normalizing flows

VampPrior

Implicit prior

Adversarial learning

MMD

Wasserstein AE

50

Components of VAEs

Normalizing flows

Discrete encoders



Group theory

Resnets

DRAW


Normalizing flows


Normalizing flows

VampPrior

Implicit prior


MMD

Wasserstein AE

51

Variational posterior in VAEs

Question: How to minimize the

KL(q||p)?

In other words: How to formulate a

more flexible family of approximate

(variational) posteriors?

Using Gaussian is not sufficiently

flexible.

We need a computationally efficient

tool.

Arvanitidis, G., Hansen, L. K., & Hauberg, S. (2017). Latent space oddity: on the curvature of deep generative models. arXiv preprint arXiv:1710.11379.

52

● Sample from a “simple” distribution:

Variational inference with normalizing flows

Rezende, D. J., & Mohamed, S. (2015). Variational inference with normalizing flows. ICML 2015

53



● Apply a sequence of K invertible transformations:


0 0 0

...

54



● Apply a sequence of K invertible transformations:

and the change of variables yields:


0 0 0

...

55


The learning objective (ELBO) with normalizing flows becomes:


56


The learning objective (ELBO) with normalizing flows becomes:

The difficulty lies in calculating the Jacobian determinant:

● Volume-preserving flows:

● General normalizing flows:

○ is “easy” to compute


57

First, let us take a look at planar flows (Rezende & Mohamed, 2015):

This is equivalent to a residual layer with a single neuron.

Sylvester Normalizing Flows

= + +h

58

First, let us take a look at planar flows (Rezende & Mohamed, 2015):

This is equivalent to a residual layer with a single neuron.

Can we calculate the Jacobian determinant efficiently?


= + +h

59

We can use the matrix determinant lemma to get the Jacobian determinant:

which is linear wrt the number of z’s.


60

We can use the matrix determinant lemma to get the Jacobian determinant:

which is linear wrt the number of z’s.

The bottleneck requires many steps, so how we can improve on that?

1. Can we generalize planar flows?

2. If yes, how can we compute the Jacobian determinant efficiently?


61

We can control the bottleneck by generalizing u and w to A and B.

SNF: Generalizing Planar Flows

= + +h

62


How to calculate det of Jacobian?


= + +h

63


How to calculate det of Jacobian? Use Sylvester Determinant Identity:


= + +h

64


How to calculate det of Jacobian? Use Sylvester Determinant Identity:

OK, but it’s very expensive! Can we simplify these calculations?


= + +h

65

Use of Sylvester Determinant Identity yields:

Next, we can use QR decomposition to represent A and B:


columns are orthonormal vectors

triangular matrices

66

SNF: Invertible transformations

But is the proposed flow invertible in general?

67


But is the proposed flow invertible in general? NO

68


But is the proposed flow invertible in general? NO.

Theorem

If is smooth with bounded strictly positive derivative, and if

and , then

is invertible.

Hence:

1. For Q and R’s computing the Jacobian-determinant is efficient.

2. Restricting R’s results in invertible transformations.

69


But is the proposed flow invertible in general? NO.

Theorem

If is smooth with bounded strictly positive derivative, and if

and , then

is invertible.

Hence:

1. For Q and R’s computing the Jacobian-determinant is efficient.

2. Restricting R’s results in invertible transformations.

But how to keep Q orthogonal?

70

SNF: Learning orthogonal matrix

1. (O-SNF) Iterative orthogonalization procedure (e.g., Kovarik, 1970):

a. Repeat until convergence:

b. We can backpropagate through this procedure.

c. We can control the bottleneck by changing the number of columns.

2. (H-SNF) Use l Householder transformations to represent Q.

a. Then, SNF is a non-linear extension of the Householder flow.

b. No bottleneck!

3. (T-SNF) Alternate between identity matrix and a fixed permutation matrix.

a. It ensures that all elements of z are processed equally on average.

b. Used also in RealNVP and IAF.

71

● A single step:

● Keep Q orthogonal:

○ With bottleneck: O-SNF.

○ No bottleneck: H-SNF, T-SNF.


72

● A single step:

● Keep Q orthogonal:

○ With bottleneck: O-SNF.

○ No bottleneck: H-SNF, T-SNF.

● In order to increase flexibility, we can use hypernets to calculate Q and R’s:


= + +h

g

73

SNF: Results on MNIST

74

SNF: Results on other data

No. of flows: 16

IAF: 1280 wide MADE, no hypernets

Bottleneck in O-SNF: 32

No. of Householder transformations in H-SNF: 8

75

Components of VAEs

Normalizing flows

Discrete encoders



Group theory

Resnets

DRAW


Normalizing flows


Normalizing flows

VampPrior

Implicit prior


MMD

Wasserstein AE

76

Geometric perspective on VAEs

Question: Is it possible to recover the

true Riemannian structure of the

latent space?

In other words: Will geodesics follow

data manifold?


77




latent space?


data manifold?

For Gaussian VAE: No.


78




latent space?


data manifold?


We need a better notion of

uncertainty


79




latent space?


data manifold?


We need a better notion of

uncertainty or different models.


80

Potential problems with Gaussians

In VAEs it is very often assumed that the posterior and the prior are Gaussians.

81


In VAEs it is very often assumed that the posterior and the prior are Gaussians.

82


In VAEs it is very often assumed that the posterior and the prior are Gaussians. But:

● The Gaussian prior is concentrated around the origin ⟶ possible bias.

83


In VAEs it is very often assumed that the posterior and the prior are Gaussians. But:

● The Gaussian prior is concentrated around the origin ⟶ possible bias.

● In high-dim, the Gaussian concentrates on a hypersphere ⟶ ℓ2 norm fails.

84

Using hyperspherical latent space

Since in high-dim the Gaussian distribution

concentrates on a hypersphere, we propose to

use a distribution defined on the hypersphere -

von-Mises-Fisher distribution:

where is the modified Bessel

function of the first kind of order v.

Davidson, T. R., Falorsi, L., De Cao, N., Kipf, T., & Tomczak, J. M. (2018). Hyperspherical Variational Auto-Encoders. UAI 2018

85

Hyperspherical VAE

● We define the latent space to be

● The variational dist. is the von-Mises-Fisher, and the prior is uniform, i.e., von-Mises-

Fisher with . Then the KL term is as follows:

86

Hyperspherical VAE

● We define the latent space to be

● The variational dist. is the von-Mises-Fisher, and the prior is uniform, i.e., von-Mises-

Fisher with . Then the KL term is as follows:

● There exist an efficient sampling procedure using Householder transformation (Ulrich,

1984).

● The reparameterization trick could be achieved by using the rejection sampling

(Naesseth et al., 2017).

87

Hyperspherical VAE: Results on MNIST

88

Hyperspherical VAE: Results on MNIST

89

Hyperspherical VAE: Results on semi-supervised MNIST

90

Hyperspherical GraphVAE: Link prediction

91

Components of VAEs

Normalizing flows

Discrete encoders



Group theory

Resnets

DRAW


Normalizing flows


Normalizing flows

VampPrior

Implicit prior


MMD

Wasserstein AE

92

● There is a discrepancy between

posteriors and the Gaussian prior

that results in regions that were

never “seen” by the posterior

(holes). ⇾multi-modal prior

● Sampling process could produce

unrealistic samples.

Problems of holes in VAEs

Rezende, D.J. and Viola, F., 2018. Taming VAEs. arXiv preprint arXiv:1810.00597.

Multi-modal priorStandard

93

● Let’s rewrite ELBO over the training data:

Looking for the optimal prior

Tomczak, J.M., Welling, M. (2018), VAE with a VampPrior, AISTATS 2018

94




95


● KL = 0 iff , then the optimal prior = aggregated posterior.



96



● Summing over all training data is infeasible and since the sample is finite, it could cause

some additional instabilities.



97




some additional instabilities. Instead we propose to use:



98






Tomczak, J.M., Welling, M. (2018), VAE with a VampPrior, AISTATS 2018 Multi-modal prior

99







pseudoinputs are trained

from scratch by SGD

100

VampPrior: Experiments (pseudoinputs)

101

VampPrior: Experiments (samples)

102

VampPrior: Experiments (reconstructions)

103

Flow-based models

104

The change of variables formula

● Let’s recall the change of variables formula with invertible transformations:

● We can think of it as an invertible neural network:

Rippel, O., & Adams, R. P. (2013). High-dimensional probability estimation with deep density models. arXiv preprint arXiv:1302.5125.

0 0 0

...

pixel spacelatent space

105

The change of variables formula

● Let’s recall the change of variables formula with invertible transformations:

● We can think of it as an invertible neural network:

0 0 0

...

pixel spacelatent space

Rippel, O., & Adams, R. P. (2013). High-dimensional probability estimation with deep density models. arXiv preprint arXiv:1302.5125.

106

RealNVP

● Design the invertible transformations as follows:

Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using real nvp. arXiv preprint arXiv:1605.08803.

107

RealNVP


● Invertible by design:


108

RealNVP


● Invertible by design:

● Easy Jacobian:


109

Results

110

Results

111

GLOW

● Adding trainable 1x1 convolution followed

by affine coupling layer.

● Adding actnorm.

Kingma, D. P., & Dhariwal, P. (2018). Glow: Generative flow with invertible 1x1 convolutions. NIPS

112

Results

Kingma, D. P., & Dhariwal, P. (2018). Glow: Generative flow with invertible 1x1 convolutions. NIPS

113

Future directions

114

Blurriness and sampling in VAEs

● How to avoid sampling from holes?

● Should we follow geodesics in the

latent space?

● How to use geometry of the latent

space to build better decoders?

● How to build temporal decoders?

Can we do better than Conv3D?

115

Compression and VAEs

● Taking a deterministic encoder

allows to simplify the objective.

● It is important to learn a powerful

prior. This is challenging!

● Is it easier to learn a prior with

temporal dependencies?

● Can we alleviate some dependencies

by using hypernets?

116

Active learning/RL and VAEs

● Using latent representation to

navigate and/or quantify uncertainty.

● Formulating policies in the latent

space entirely.

● Do we need a better notion of

sequential dependencies?

117

Hybrid and flow-based models

● We need a better understanding of

the latent space.

● Joining an invertible model (flow-

based model) with a predictive

model.

● Isn’t this model an overkill?

● How would it work in the multi-

modal learning scenario?

118

Hybrid models and OOO sample

● Going back to first slides, we need a

good notion of p(x).

● Distinguishing out-of-distribution

(OOO) samples is very important.

● Crucial for decision making, outlier

detection, policy learning…

References in this presentation to “Qualcomm” may mean Qualcomm

Incorporated, Qualcomm Technologies, Inc., and/or other subsidiaries

or business units within the Qualcomm corporate structure, as

applicable. Qualcomm Incorporated includes Qualcomm’s licensing

business, QTL, and the vast majority of its patent portfolio. Qualcomm

Technologies, Inc., a wholly-owned subsidiary of Qualcomm

Incorporated, operates, along with its subsidiaries, substantially all of

Qualcomm’s engineering, research and development functions, and

substantially all of its product and services businesses, including its

semiconductor business, QCT.

Follow us on:

For more information, visit us at:

www.qualcomm.com & www.qualcomm.com/blog

Thank you!

Nothing in these materials is an offer to sell any of the

components or devices referenced herein.

©2019 Qualcomm Technologies, Inc. and/or its affiliated

companies. All Rights Reserved.

Qualcomm is a trademark of Qualcomm Incorporated,

registered in the United States and other countries. Other

products and brand names may be trademarks or registered

trademarks of their respective owners.

Qualcomm AI Research is an initiative of Qualcomm

Technologies Inc.

may 22, 2019 @uva amsterdam - github pages · deep generative modeling jakub m. tomczak deep...

Documents