when a neural net meets a gaussian process · neural net (fully connected) दഇ ആ 𝐼 ध ഇ...

When a neural net meets

a Gaussian process

Guofei Pang

05/11/18 Crunch seminar

Neural net (fully connected)

𝑥10 𝐼 𝑦1

0

𝑥20 𝐼 𝑦2

0

𝑥𝑁0

0 𝐼 𝑦𝑁0

0

𝑥11 𝑦1

0

𝑥21 𝑦2

1

𝑥𝑁1

1 𝑦𝑁1

1

𝑤1,11,0 , 𝑏1

1

𝑤1,21,0 , 𝑏1

1

𝑤1,N0

1,0 , 𝑏11

𝑥1𝐿 𝑦1

𝐿

𝑥2𝐿 𝑦2

𝐿

𝑥𝑁𝐿

𝐿 𝑦𝑁𝐿

𝐿

𝑥1𝐿+1 𝑦1

𝐿+1

𝑥2𝐿+1 𝑦2

𝐿+1

𝑥𝑁𝐿+1

𝐿+1 𝑦𝑁𝐿+1

𝐿+1

𝑤1,1𝐿+1,𝐿 , 𝑏1

𝐿+1

𝑤1,2L+1,𝐿 , 𝑏1

L+1

𝑤1,𝑁𝐿

L+1,𝐿 , 𝑏1L+1

Input layer First hidden layer Last hidden layer Output layer

𝑦𝑖0 = 𝑥𝑖

0, i =1,2, …, N0

𝑥𝑖1 = 𝑤𝑖,𝑗

1,0𝑦𝑗0 + 𝑏𝑖

1𝑁0𝑗=1

𝑦𝑖1= 𝜙(𝑥𝑖

1) i =1,2, …, N1

𝑥𝑖𝐿+1 = 𝑤𝑖,𝑗

𝐿+1,𝐿𝑦𝑗𝐿 + 𝑏𝑖

𝐿+1𝑁L𝑗=1

𝑦𝑖𝐿+1= 𝜙(𝑥𝑖

𝐿+1) i =1,2, …, NL+1

2/18


𝑥10 𝐼 𝑦1

0

𝑥20 𝐼 𝑦2

0

𝑥𝑁0

0 𝐼 𝑦𝑁0

0

𝑥11 𝑦1

0

𝑥21 𝑦2

1

𝑥𝑁1

1 𝑦𝑁1

1

𝑤1,11,0 , 𝑏1

1

𝑤1,21,0 , 𝑏1

1

𝑤1,N0

1,0 , 𝑏11

𝑥1𝐿 𝑦1

𝐿

𝑥2𝐿 𝑦2

𝐿

𝑥𝑁𝐿

𝐿 𝑦𝑁𝐿

𝐿

𝑥1𝐿+1 𝑦1

𝐿+1

𝑥2𝐿+1 𝑦2

𝐿+1

𝑥𝑁𝐿+1


𝐿+1

𝑤1,1𝐿+1,𝐿 , 𝑏1

𝐿+1

𝑤1,2L+1,𝐿 , 𝑏1

L+1

𝑤1,𝑁L

L+1,𝐿 , 𝑏1L+1

Input layer First hidden layer Last hidden layer Output layer

𝑥𝑖𝑙+1 = 𝑤𝑖,𝑗

𝑙+1,𝑙𝑦𝑗𝑙 + 𝑏𝑖

𝑙+1𝑁𝑙𝑗=1

𝑦𝑖𝑙+1= 𝜙(𝑥𝑖

𝑙+1) i =1,2, …, Nl+1

l = 0, 1, …, L

𝑦𝑖0 = 𝑥𝑖

0

𝑦1𝐿+1 = 𝑓1 𝑥1

0, 𝑥20, … , 𝑥𝑁

0

0 ; {𝐖𝑙+1,𝑙}, {𝐛𝑙+1} 𝑦2

𝐿+1= 𝑓2 𝑥10, 𝑥2

0, … , 𝑥𝑁0

0 ; {𝐖𝑙+1,𝑙}, {𝐛𝑙+1} … … … … 𝑦𝑁

𝐿+1

𝐿+1= 𝑓𝑁𝐿+1 𝑥10, 𝑥2

0, … , 𝑥𝑁0

0 ; {𝐖𝑙+1,𝑙}, {𝐛𝑙+1}


𝑥10 𝐼 𝑦1

0

𝑥20 𝐼 𝑦2

0

𝑥𝑁0

0 𝐼 𝑦𝑁0

0

𝑥11 𝑦1

0

𝑥21 𝑦2

1

𝑥𝑁1

1 𝑦𝑁1

1

𝑤1,11,0 , 𝑏1

1

𝑤1,21,0 , 𝑏1

1

𝑤1,N0

1,0 , 𝑏11

𝑥1𝐿 𝑦1

𝐿

𝑥2𝐿 𝑦2

𝐿

𝑥𝑁𝐿

𝐿 𝑦𝑁𝐿

𝐿

𝑥1𝐿+1 𝑦1

𝐿+1

𝑥2𝐿+1 𝑦2

𝐿+1

𝑥𝑁𝐿+1


𝐿+1

𝑤1,1𝐿+1,𝐿 , 𝑏1

𝐿+1

𝑤1,2L+1,𝐿 , 𝑏1

L+1

𝑤1,𝑁𝐿

𝐿+1,𝐿 , 𝑏1𝐿+1

𝐱0 𝐼 𝐲0 𝐱1 𝐲1 𝐱𝐿+1 𝐲𝐿+1 𝐱𝐿 𝐲𝐿 𝐖1,0, 𝐛1 𝐖𝐿+1,𝐿, 𝐛𝐿

𝐱𝑙+1 = 𝐖𝑙+1,𝑙𝐲𝑙 +𝐛𝑙+1 𝐲𝑙+1 = 𝜙(𝐱𝑙+1), l = 1,2,…, L

𝐲0 = 𝐱0

𝒚 = 𝒚𝐿+1 = 𝐟𝑁𝑁(𝐱 = 𝐱0;𝐖 = {𝐖𝑙+1,𝑙} , 𝐛 = {𝐛𝑙+1} )

4/18

Gaussian process

Multivariate normal distribution

𝑝 Y =1

|2𝜋K|1/2 exp{-

1

2(Y-m)TK-1(Y-m)}

𝐘 = [𝑌1, 𝑌2, … 𝑌𝑁]T

m = E(Y) = [E 𝑌1 , E(𝑌2), …E(𝑌𝑁)]T

K = [Kij] = [cov (Yi , Yj)]

Index set of the random vector Y is a discrete set

Index set {1,2, …, N}

What if a continuous index set ?

Say, a unit interval [0, 1]

5/18

Discrete index i {1,2,…, 10} Zero mean m = 0 Covariance matrix K is Kij = cov (Yi , Yj) = exp{- (i –j)2 * 10}

Continuous index x [0, 1] Zero mean function m(x) = 0 Covariance function k(x, x’) is cov (Yx , Yx’) = exp{- (x– x’)2 * 10}

Yx := fGP(x) GP(m(x), k(x,x’)) Y ℕ(m, K)

6/18

y = fGP(x; , ) GP(m (x), k (x, x’))

Neural net (fully connected) & Gaussian process

y= 𝑓𝑁𝑁(𝐱;𝐖 , 𝐛 ) Deterministic function fNN

Stochastic function fGP

7/18

y = fGP(x; , ) GP(m (x), k (x, x’))


y= 𝑓𝑁𝑁(𝐱;𝐖 , 𝐛 ) GP(0, 𝑘

𝜎𝑤2, 𝜎𝑏

2 (x, x’))

𝐖𝑙+1,𝑙 𝐷 𝟎,𝜎𝑤2

𝑁𝑙𝐈 (𝒊. 𝒊. 𝒅)

𝐛𝑙+1 𝐷 𝟎, 𝜎𝑏2𝐈 (𝒊. 𝒊. 𝒅)

𝑁𝑙 ∞ for hidden layers

8/18

𝐱 𝐼 𝐲0 𝐱1 𝐲1 𝐱2 𝐲2 𝐼 𝐖1,0, 𝐛1 𝐖2,1, 𝐛2

𝑦𝑖0 = 𝑥𝑖 ,

i =1,2, …, N0


1,0𝑦𝑗0 + 𝑏𝑖

1𝑁0𝑗=1


1) i =1,2, …, N1


2,1𝑦𝑗1 + 𝑏𝑖

2𝑁0𝑗=1

𝑦𝑖2= 𝑥𝑖

2 i =1,2, …, N2

𝐱′ 𝐼 𝐲0 𝐱1 𝐲1 𝐱2 𝐲2 𝐼 𝐖1,0, 𝐛1 𝐖2,1 , 𝐛2

𝑦𝑖0 = 𝑥′𝑖 ,

i =1,2, …, N0


1,0𝑦𝑗0 + 𝑏𝑖

1𝑁0𝑗=1


1) i =1,2, …, N1


2,1𝑦𝑗1 + 𝑏𝑖

2𝑁1𝑗=1

𝑦𝑖2= 𝑥𝑖

2 i =1,2, …, N2

k0(x, x’) = Cov (𝑥𝑖1(x), 𝑥𝑖

1(x′)) = 𝜎𝑏2 +

𝜎𝑤2

𝑁0

x x′ regardless of the subscript i

k1(x, x’) = Cov (𝑥𝑖2(x), 𝑥𝑖

2(x′)) = 𝜎𝑏2 + 𝜎𝑤

2F(k0(x, x’), k0(x, x), k0(x’, x’) )

𝑥𝑖1 𝐱 𝐷𝑃(𝟎, 𝑘0(𝐱, 𝐱′)) 𝑥𝑖

2 𝐱 𝐺𝑃(0, 𝑘1(𝐱, 𝐱′))

9/18


y= 𝑓𝑁𝑁(𝐱;𝐖 , 𝐛 ) GP(0, 𝑘𝜎𝑤2, 𝜎𝑏

2 (x, x’))

𝐖𝑙+1,𝑙 𝐷 𝟎,𝜎𝑤2

𝑁𝑙𝐈 (𝒊. 𝒊. 𝒅)

𝐛𝑙+1 𝐷 𝟎, 𝜎𝑏2𝐈 (𝒊. 𝒊. 𝒅)

𝑁𝑙 ∞ for hidden layers

𝑘𝜎𝑤2, 𝜎𝑏

2 (x, x’) = 𝑘𝐿 (x, x’) = Cov (𝑥𝑖𝐿+1(x), 𝑥𝑖

𝐿+1(x′))

𝑘𝑙(x, x’) = 𝜎𝑏2 + 𝜎𝑤

2F (𝑘𝑙−1(x, x’), 𝑘𝑙−1(x, x), 𝑘𝑙−1(x’, x’) ) ,

𝑙 = 1, 2, …, 𝐿

𝑘0 (x, x’) = Cov (𝑥𝑖1(x), 𝑥𝑖

1(x′)) = 𝜎𝑏2 +

𝜎𝑤2

𝑁0

x x′

10/18

NNGP: Gaussian process with NN-induced covariance function

Discovering the equivalence between infinite NN and GP

(RM Neal, Bayesian learning for neural networks, 1994 )

Deducing the analytical GP kernels for single hidden-layer NN

with specific nonlinearity: error function or Gaussian

nonlinearity

(CKI William, Computing with infinite networks, 1997)

Deriving the GP kernel for multi-hidden-layer NN with general

nonlinearity based on signal propagation theory (Cho &

Saul,2009) (NNGP)

(J Lee et al, Deep neural networks as Gaussian processes, 2018)

11/18

NNGP: Gaussian process with NN-induced covariance function

NN GP Bayesian NN NNGP

Expressivity High Intermediate High High

Uncertainty No Yes Yes Yes

Cost Low intermediate High intermediate

Accuracy It depends It depends It depends It depends

12/18

Expressivity

Deep neural networks (DNN) can compactly express highly

complex functions over input space in a way that shallow networks

with one hidden layer and the same number of neurons cannot.

(Regression)

Deep neural networks can disentangle highly curved manifolds in

input space into flattened manifolds in hidden space.

(Classification)

GP model is actually a kernel learning machine whose performance

relies on selection of the covariance kernel.

NNGP enjoys the high expressivity of DNN.

Poole B, Lahiri S, Raghu M, Sohl-Dickstein J, Ganguli S.

Exponential expressivity in deep neural networks through transient chaos. In Advances in neural information processing systems 2016 (pp. 3360-3368).

13/18

Uncertainty quantification / surrogate accuracy

14/18

Accuracy (Regression problem)

15/18

Accuracy (Classification problem, J Lee et al, 2018)

16/18

Computational cost

NN: Loss function is generally convex -- stochastic gradient descent optimization algorithm (Training cost is generally O(n) where n is number of neurons)

GP: Loss function is non-convex -- gradient descent algorithms need multi-starting points. Inverting covariance matrix requires O(N3) complexity where N is the number of training datapoints.

Bayesian NN: Exact Bayesian inference is NP- hard. Approximate Bayesian inference could also be NP hard. Cost of MCMC for approximating posterior distribution of network parameters could be o(m2) where m is the parameters number.

NNGP: Same as GP

17/18

Future work – to improve, apply, and generalize NNGP

Training a NNGP

NNGP for solving PDEs (forward and inverse problems)

Develop NNGP for other types of NNs, say, CNN, RNN, GAN … …

18/18

when a neural net meets a gaussian process · neural net (fully connected) दഇ ആ 𝐼 ध ഇ...

Documents