learning representations for efﬁcient...

Learning representations for efficient inference

Matt Johnson, David Duvenaud, Alex Wiltschko, Bob Datta, Ryan Adams

Natural gradient SVI for nice PGMs

⇤(x) , argmax

q(x)L[ q(✓)q(x) ]

⇤(x) , argmax

q(x)L[ q(✓)q(x) ]

Variational autoencoders and inference networks

⇤(x) , N (x |µ(y;�),⌃(y;�))

⇤(x) , argmax

q(x)L[ q(✓)q(x) ]

[1] Hoffman, Bach, Blei. Online learning for Latent Dirichlet Allocation. NIPS 2010. [2] Hoffman, Blei, Wang, and Paisley. Stochastic variational inference. JMLR 2013.

p(x | ✓) is a linear dynamical system

p(y |x, ✓) is a linear-Gaussian observation

p(✓) is a conjugate prior

✓x1 x2 x3 x4

y1 y2 y3 y4

q(✓)q(x) ⇡ p(✓, x | y)p(x | ✓) is a linear dynamical system

✓x1 x2 x3 x4

y1 y2 y3 y4

✓x1 x2 x3 x4

q(✓)q(x) ⇡ p(✓, x | y)

L(⌘✓

, ⌘x

) , Eq(✓)q(x)

p(✓,x,y)q(✓)q(x)

✓x1 x2 x3 x4

y1 y2 y3 y4

✓x1 x2 x3 x4

q(✓)q(x) ⇡ p(✓, x | y)

L(⌘✓

, ⌘x

) , Eq(✓)q(x)

p(✓,x,y)q(✓)q(x)

⌘⇤x

(⌘✓

) , argmax

L(⌘✓

, ⌘x

) LSVI(⌘✓) , L(⌘✓

, ⌘⇤x

(⌘✓

✓x1 x2 x3 x4

y1 y2 y3 y4

✓x1 x2 x3 x4

q(✓)q(x) ⇡ p(✓, x | y)

L(⌘✓

, ⌘x

) , Eq(✓)q(x)

p(✓,x,y)q(✓)q(x)

Proposition (natural gradient SVI of Ho↵man et al. 2013)

erLSVI(⌘✓) = ⌘

⇤(x)(txy(x, y), 1)� ⌘

⌘⇤x

(⌘✓

) , argmax

L(⌘✓

, ⌘x

) LSVI(⌘✓) , L(⌘✓

, ⌘⇤x

(⌘✓

✓x1 x2 x3 x4

y1 y2 y3 y4

✓x1 x2 x3 x4

q(✓)q(x) ⇡ p(✓, x | y)

L(⌘✓

, ⌘x

) , Eq(✓)q(x)

p(✓,x,y)q(✓)q(x)

Proposition (natural gradient SVI of Ho↵man et al. 2013)

erLSVI(⌘✓) = ⌘

⇤(xn)(txy(xn

), 1)� ⌘

⌘⇤x

(⌘✓

) , argmax

L(⌘✓

, ⌘x

) LSVI(⌘✓) , L(⌘✓

, ⌘⇤x

(⌘✓

✓x1 x2 x3 x4

y1 y2 y3 y4N

✓x1 x2 x3 x4

Step 1: compute evidence potentials

[1] Johnson and Willsky. Stochastic variational inference for Bayesian time series models. ICML 2014. [2] Foti, Xu, Laird, and Fox. Stochastic variational inference for hidden Markov models. NIPS 2014.

Step 1: compute evidence potentials Step 2: run fast message passing

Step 3: compute natural gradient

Step 2: run fast message passing

arbitrary inference queries

What about more general observation models?

p(y |x, �) is a neural network decoder

p(✓) is a conjugate prior, p(�) is generic

✓x1 x2 x3 x4

y1 y2 y3 y4

✓x1 x2 x3 x4

y1 y2 y3 y4

x1 x2 x3 x4

q(✓)q(�)q(x) ⇡ p(✓, �, x | y)

✓x1 x2 x3 x4

y1 y2 y3 y4

x1 x2 x3 x4

q(✓)q(�)q(x) ⇡ p(✓, �, x | y)

L(⌘✓

) , Eq(✓)q(✓)q(x)

p(✓, �, x)p(y |x, �)q(✓)q(�)q(x)

✓x1 x2 x3 x4

y1 y2 y3 y4

x1 x2 x3 x4

q(✓)q(�)q(x) ⇡ p(✓, �, x | y)

L(⌘✓

) , Eq(✓)q(✓)q(x)

p(✓, �, x)p(y |x, �)q(✓)q(�)q(x)

(⌘✓

, ⌘�

) , argmax

L(⌘✓

, ⌘�

, ⌘x

LSVI(⌘✓, ⌘�) , L(⌘✓

, ⌘�

, ⌘?x

(⌘✓

, ⌘�

Variational autoencoders and inference networks

⇤(x) , N (x |µ(y;�),⌃(y;�))

[1] Kingma and Welling. Auto-encoding variational Bayes. ICLR 2014. [2] Rezende, Mohamed, and Wierstra. Stochastic backpropagation and approximate inference in deep generative models. ICML 2014

�xn xn

?(xn) , N (xn |µ(yn;�), ⌃(yn;�))

�xn xn

?(xn) , N (xn |µ(yn;�), ⌃(yn;�))

µ(yn;�) diag⌃(yn;�)

�xn xn

?(xn) , N (xn |µ(yn;�), ⌃(yn;�))

LVAE(⌘� ,�) , L(⌘�

, ⌘?x

(�))

µ(yn;�) diag⌃(yn;�)

[1] Archer, Park, Buesing, Cunningham, Paninski. Black box variational inference for state space models. ICLR 2016 Workshops.[2] Gao*, Archer*, Paninski, Cunningham. Linear dynamical neural population models through nonlinear embeddings. NIPS 2016.

µt(yt;�µ)

Jt,t(yt;�D)

Jt,t+1(yt, yt+1;�B)

[1] Archer, Park, Buesing, Cunningham, Paninski. Black box variational inference for state space models. ICLR 2016 Workshops.[2] Gao*, Archer*, Paninski, Cunningham. Linear dynamical neural population models through nonlinear embeddings. NIPS 2016.[3] Krishnan, Shalit, Sontag. Structured inference networks for nonlinear state space models. AISTATS 2017.

⌃t(y1:T , x̂t�1;�)

µt(y1:T , x̂t�1;�)

µt(yt;�µ)

Jt,t(yt;�D)

Jt,t+1(yt, yt+1;�B)

Natural gradient SVI

⇤(x) , argmax

q(x)L[ q(✓)q(x) ]

– expensive for general obs.

⇤(x) , argmax

q(x)L[ q(✓)q(x) ]

+ optimal local factor

⇤(x) , argmax

q(x)L[ q(✓)q(x) ]

+ exploits conj. graph structure

⇤(x) , argmax

q(x)L[ q(✓)q(x) ]

+ arbitrary inference queries

⇤(x) , argmax

q(x)L[ q(✓)q(x) ]

+ natural gradients

⇤(x) , argmax

q(x)L[ q(✓)q(x) ]

+ natural gradients

Natural gradient SVI Variational autoencoders

⇤(x) , N (x |µ(y;�),⌃(y;�))

⇤(x) , argmax

q(x)L[ q(✓)q(x) ]

+ natural gradients

+ fast for general obs.

– suboptimal local factor

– does all local inference

– limited inference queries

– no natural gradients

Natural gradient SVI Variational autoencoders

⇤(x) , N (x |µ(y;�),⌃(y;�))

⇤(x) , argmax

q(x)L[ q(✓)q(x) ]

+ natural gradients

Natural gradient SVI Variational autoencoders Structured VAEs

⇤(x) , N (x |µ(y;�),⌃(y;�))

p qp q

⇤(x) , argmax

q(x)L[ q(✓)q(x) ]

⇤(x) , ?

[1] Johnson, Duvenaud, Wiltschko, Datta, and Adams. Composing graphical models and neural networks. NIPS 2016.

+ natural gradients

± optimal given conj. evidence

+ natural gradients on

Natural gradient SVI Variational autoencoders Structured VAEs

⌘✓

⇤(x) , N (x |µ(y;�),⌃(y;�))

p qp q

⇤(x) , argmax

q(x)L[ q(✓)q(x) ]

⇤(x) , ?

[1] Johnson, Duvenaud, Wiltschko, Datta, and Adams. Composing graphical models and neural networks. NIPS 2016.

SVAEs: recognition networks output conjugate potentials,

then apply fast graphical model algorithms

L(⌘✓

, ⌘�

, ⌘x

) , Eq(✓)q(�)q(x)

p(✓,�,x)p(y | x,�)q(✓)q(�)q(x)

✓x1 x2 x3 x4

y1 y2 y3 y4

L(⌘✓

, ⌘�

, ⌘x

) , Eq(✓)q(�)q(x)

p(✓,�,x)p(y | x,�)q(✓)q(�)q(x)

✓x1 x2 x3 x4

y1 y2 y3 y4

Eq(�) log p(yt |xt, �)

L(⌘✓

, ⌘�

, ⌘x

) , Eq(✓)q(�)q(x)

p(✓,�,x)p(y | x,�)q(✓)q(�)q(x)

(xt; yt,�)

✓x1 x2 x3 x4

y1 y2 y3 y4

L(⌘✓

, ⌘�

, ⌘x

) , Eq(✓)q(�)q(x)

p(✓,�,x)p(y | x,�)q(✓)q(�)q(x)

where (x; y,�) is a conjugate potential for p(x | ✓)

bL(⌘✓

, ⌘x

,�) , Eq(✓)q(�)q(x)

p(✓,�,x) exp{ (x;y,�)}q(✓)q(�)q(x)

(xt; yt,�)

✓x1 x2 x3 x4

y1 y2 y3 y4

x1 x2 x3 x4

y1 y2 y3 y4

L(⌘✓

, ⌘�

, ⌘x

) , Eq(✓)q(�)q(x)

p(✓,�,x)p(y | x,�)q(✓)q(�)q(x)

⌘⇤x

(⌘✓

,�) , argmax

bL(⌘✓

, ⌘x

,�) LSVAE(⌘✓, ⌘� ,�) , L(⌘✓

, ⌘�

, ⌘⇤x

(⌘✓

,�))

where (x; y,�) is a conjugate potential for p(x | ✓)

bL(⌘✓

, ⌘x

,�) , Eq(✓)q(�)q(x)

p(✓,�,x) exp{ (x;y,�)}q(✓)q(�)q(x)

(xt; yt,�)

✓x1 x2 x3 x4

y1 y2 y3 y4

x1 x2 x3 x4

y1 y2 y3 y4

Step 1: apply recognition network

Step 1: apply recognition network Step 2: run fast PGM algorithms

Step 3: sample, compute flat grads

Step 1: apply recognition network Step 2: run fast PGM algorithms

Step 2: run fast PGM algorithms

data space latent space

frame index

[1] Corduneanu and Bishop. Variational Bayesian Model Selection for Mixture Distributions. AISTATS 2001.[2] Ghahramani and Beal. Propagation algorithms for variational Bayesian learning. NIPS 2001.[3] Beal. Variational algorithms for approximate Bayesian inference, Ch. 3. U of London Ph.D. Thesis 2003.[4] Ghahramani and Hinton. Variational learning for switching state-space models. Neural Computation 2000.

[1] [2] [4]

Gaussian mixture model Linear dynamical system Hidden Markov model Switching LDS

[1] Corduneanu and Bishop. Variational Bayesian Model Selection for Mixture Distributions. AISTATS 2001.[2] Ghahramani and Beal. Propagation algorithms for variational Bayesian learning. NIPS 2001.[3] Beal. Variational algorithms for approximate Bayesian inference, Ch. 3. U of London Ph.D. Thesis 2003.[4] Ghahramani and Hinton. Variational learning for switching state-space models. Neural Computation 2000.[5] Jordan and Jacobs. Hierarchical Mixtures of Experts and the EM algorithm. Neural Computation 1994.[6] Bengio and Frasconi. An Input Output HMM Architecture. NIPS 1995.[7] Ghahramani and Jordan. Factorial Hidden Markov Models. Machine Learning 1997.

[6][2][5]

Mixture of Experts Driven LDS IO-HMM Factorial HMM

[1] [2] [4]

[1] Corduneanu and Bishop. Variational Bayesian Model Selection for Mixture Distributions. AISTATS 2001.[2] Ghahramani and Beal. Propagation algorithms for variational Bayesian learning. NIPS 2001.[3] Beal. Variational algorithms for approximate Bayesian inference, Ch. 3. U of London Ph.D. Thesis 2003.[4] Ghahramani and Hinton. Variational learning for switching state-space models. Neural Computation 2000.[5] Jordan and Jacobs. Hierarchical Mixtures of Experts and the EM algorithm. Neural Computation 1994.[6] Bengio and Frasconi. An Input Output HMM Architecture. NIPS 1995.[7] Ghahramani and Jordan. Factorial Hidden Markov Models. Machine Learning 1997.[8] Bach and Jordan. A probabilistic interpretation of Canonical Correlation Analysis. Tech. Report 2005.[9] Archambeau and Bach. Sparse probabilistic projections. NIPS 2008.[10] Hoffman, Bach, Blei. Online learning for Latent Dirichlet Allocation. NIPS 2010.

[8,9] [10]

Canonical correlations analysis admixture / LDA / NMF

[6][2][5]

Mixture of Experts Driven LDS IO-HMM Factorial HMM

[1] [2] [4]

arbitrary inference queries*

*see next slide

SVAEs can use any inference network architectures

[1] Archer, Park, Buesing, Cunningham, Paninski. Black box variational inference for state space models. ICLR 2016 Workshops. [2] Gao*, Archer*, Paninski, Cunningham. Linear dynamical neural population models through nonlinear embeddings. NIPS 2016.

supervised learning

unsupervised learning

supervised learning

github.com/mattjj/svae

learning representations for efﬁcient...

Documents

learning efﬁcient image representations: connections...

slice and dice : a physicalization workflow for anatomical...

aninformation-theoreticapproachforenergy-efﬁcient

csc2541: differentiable inference and generative...

spectral representations for convolutional neural...

sustainability appraisal representations...sustainability...

unit 1: representations introduction to gender...

detect-and-track: efﬁcient pose estimation in videosbloom...

neural ordinary differential...

pareto efﬁcient taxation - ocw.mit.edu · pareto...

structure discovery in nonparametric regression through...

computationally efﬁcient bayesian inference for inverse...

ieee transactions on computational imaging, … · learning...

deep learning. - cvut.cz...tomas mikolov et al.: efﬁcient...

simple efﬁcient affordable

efﬁcient mcmc sampling with implicit shape representation...

negotiating representations, warranties and...

deep generative image models using a laplacian pyramid...

efﬁcient 3d object segmentation from densely sampled...

power-efﬁcient instruction encoding optimization for ......