when a neural net meets a gaussian process · neural net (fully connected) दഇ ആ 𝐼 ध ഇ...
TRANSCRIPT
When a neural net meets
a Gaussian process
Guofei Pang
05/11/18 Crunch seminar
Neural net (fully connected)
𝑥10 𝐼 𝑦1
0
𝑥20 𝐼 𝑦2
0
𝑥𝑁0
0 𝐼 𝑦𝑁0
0
𝑥11 𝑦1
0
𝑥21 𝑦2
1
𝑥𝑁1
1 𝑦𝑁1
1
𝑤1,11,0 , 𝑏1
1
𝑤1,21,0 , 𝑏1
1
𝑤1,N0
1,0 , 𝑏11
𝑥1𝐿 𝑦1
𝐿
𝑥2𝐿 𝑦2
𝐿
𝑥𝑁𝐿
𝐿 𝑦𝑁𝐿
𝐿
𝑥1𝐿+1 𝑦1
𝐿+1
𝑥2𝐿+1 𝑦2
𝐿+1
𝑥𝑁𝐿+1
𝐿+1 𝑦𝑁𝐿+1
𝐿+1
𝑤1,1𝐿+1,𝐿 , 𝑏1
𝐿+1
𝑤1,2L+1,𝐿 , 𝑏1
L+1
𝑤1,𝑁𝐿
L+1,𝐿 , 𝑏1L+1
Input layer First hidden layer Last hidden layer Output layer
𝑦𝑖0 = 𝑥𝑖
0, i =1,2, …, N0
𝑥𝑖1 = 𝑤𝑖,𝑗
1,0𝑦𝑗0 + 𝑏𝑖
1𝑁0𝑗=1
𝑦𝑖1= 𝜙(𝑥𝑖
1) i =1,2, …, N1
𝑥𝑖𝐿+1 = 𝑤𝑖,𝑗
𝐿+1,𝐿𝑦𝑗𝐿 + 𝑏𝑖
𝐿+1𝑁L𝑗=1
𝑦𝑖𝐿+1= 𝜙(𝑥𝑖
𝐿+1) i =1,2, …, NL+1
2/18
Neural net (fully connected)
𝑥10 𝐼 𝑦1
0
𝑥20 𝐼 𝑦2
0
𝑥𝑁0
0 𝐼 𝑦𝑁0
0
𝑥11 𝑦1
0
𝑥21 𝑦2
1
𝑥𝑁1
1 𝑦𝑁1
1
𝑤1,11,0 , 𝑏1
1
𝑤1,21,0 , 𝑏1
1
𝑤1,N0
1,0 , 𝑏11
𝑥1𝐿 𝑦1
𝐿
𝑥2𝐿 𝑦2
𝐿
𝑥𝑁𝐿
𝐿 𝑦𝑁𝐿
𝐿
𝑥1𝐿+1 𝑦1
𝐿+1
𝑥2𝐿+1 𝑦2
𝐿+1
𝑥𝑁𝐿+1
𝐿+1 𝑦𝑁𝐿+1
𝐿+1
𝑤1,1𝐿+1,𝐿 , 𝑏1
𝐿+1
𝑤1,2L+1,𝐿 , 𝑏1
L+1
𝑤1,𝑁L
L+1,𝐿 , 𝑏1L+1
Input layer First hidden layer Last hidden layer Output layer
𝑥𝑖𝑙+1 = 𝑤𝑖,𝑗
𝑙+1,𝑙𝑦𝑗𝑙 + 𝑏𝑖
𝑙+1𝑁𝑙𝑗=1
𝑦𝑖𝑙+1= 𝜙(𝑥𝑖
𝑙+1) i =1,2, …, Nl+1
l = 0, 1, …, L
𝑦𝑖0 = 𝑥𝑖
0
𝑦1𝐿+1 = 𝑓1 𝑥1
0, 𝑥20, … , 𝑥𝑁
0
0 ; {𝐖𝑙+1,𝑙}, {𝐛𝑙+1} 𝑦2
𝐿+1= 𝑓2 𝑥10, 𝑥2
0, … , 𝑥𝑁0
0 ; {𝐖𝑙+1,𝑙}, {𝐛𝑙+1} … … … … 𝑦𝑁
𝐿+1
𝐿+1= 𝑓𝑁𝐿+1 𝑥10, 𝑥2
0, … , 𝑥𝑁0
0 ; {𝐖𝑙+1,𝑙}, {𝐛𝑙+1}
Neural net (fully connected)
𝑥10 𝐼 𝑦1
0
𝑥20 𝐼 𝑦2
0
𝑥𝑁0
0 𝐼 𝑦𝑁0
0
𝑥11 𝑦1
0
𝑥21 𝑦2
1
𝑥𝑁1
1 𝑦𝑁1
1
𝑤1,11,0 , 𝑏1
1
𝑤1,21,0 , 𝑏1
1
𝑤1,N0
1,0 , 𝑏11
𝑥1𝐿 𝑦1
𝐿
𝑥2𝐿 𝑦2
𝐿
𝑥𝑁𝐿
𝐿 𝑦𝑁𝐿
𝐿
𝑥1𝐿+1 𝑦1
𝐿+1
𝑥2𝐿+1 𝑦2
𝐿+1
𝑥𝑁𝐿+1
𝐿+1 𝑦𝑁𝐿+1
𝐿+1
𝑤1,1𝐿+1,𝐿 , 𝑏1
𝐿+1
𝑤1,2L+1,𝐿 , 𝑏1
L+1
𝑤1,𝑁𝐿
𝐿+1,𝐿 , 𝑏1𝐿+1
𝐱0 𝐼 𝐲0 𝐱1 𝐲1 𝐱𝐿+1 𝐲𝐿+1 𝐱𝐿 𝐲𝐿 𝐖1,0, 𝐛1 𝐖𝐿+1,𝐿, 𝐛𝐿
𝐱𝑙+1 = 𝐖𝑙+1,𝑙𝐲𝑙 +𝐛𝑙+1 𝐲𝑙+1 = 𝜙(𝐱𝑙+1), l = 1,2,…, L
𝐲0 = 𝐱0
𝒚 = 𝒚𝐿+1 = 𝐟𝑁𝑁(𝐱 = 𝐱0;𝐖 = {𝐖𝑙+1,𝑙} , 𝐛 = {𝐛𝑙+1} )
4/18
Gaussian process
Multivariate normal distribution
𝑝 Y =1
|2𝜋K|1/2 exp{-
1
2(Y-m)TK-1(Y-m)}
𝐘 = [𝑌1, 𝑌2, … 𝑌𝑁]T
m = E(Y) = [E 𝑌1 , E(𝑌2), …E(𝑌𝑁)]T
K = [Kij] = [cov (Yi , Yj)]
Index set of the random vector Y is a discrete set
Index set {1,2, …, N}
What if a continuous index set ?
Say, a unit interval [0, 1]
5/18
Discrete index i {1,2,…, 10} Zero mean m = 0 Covariance matrix K is Kij = cov (Yi , Yj) = exp{- (i –j)2 * 10}
Continuous index x [0, 1] Zero mean function m(x) = 0 Covariance function k(x, x’) is cov (Yx , Yx’) = exp{- (x– x’)2 * 10}
Yx := fGP(x) GP(m(x), k(x,x’)) Y ℕ(m, K)
6/18
y = fGP(x; , ) GP(m (x), k (x, x’))
Neural net (fully connected) & Gaussian process
y= 𝑓𝑁𝑁(𝐱;𝐖 , 𝐛 ) Deterministic function fNN
Stochastic function fGP
7/18
y = fGP(x; , ) GP(m (x), k (x, x’))
Neural net (fully connected) & Gaussian process
y= 𝑓𝑁𝑁(𝐱;𝐖 , 𝐛 ) GP(0, 𝑘
𝜎𝑤2, 𝜎𝑏
2 (x, x’))
𝐖𝑙+1,𝑙 𝐷 𝟎,𝜎𝑤2
𝑁𝑙𝐈 (𝒊. 𝒊. 𝒅)
𝐛𝑙+1 𝐷 𝟎, 𝜎𝑏2𝐈 (𝒊. 𝒊. 𝒅)
𝑁𝑙 ∞ for hidden layers
8/18
𝐱 𝐼 𝐲0 𝐱1 𝐲1 𝐱2 𝐲2 𝐼 𝐖1,0, 𝐛1 𝐖2,1, 𝐛2
𝑦𝑖0 = 𝑥𝑖 ,
i =1,2, …, N0
𝑥𝑖1 = 𝑤𝑖,𝑗
1,0𝑦𝑗0 + 𝑏𝑖
1𝑁0𝑗=1
𝑦𝑖1= 𝜙(𝑥𝑖
1) i =1,2, …, N1
𝑥𝑖2 = 𝑤𝑖,𝑗
2,1𝑦𝑗1 + 𝑏𝑖
2𝑁0𝑗=1
𝑦𝑖2= 𝑥𝑖
2 i =1,2, …, N2
𝐱′ 𝐼 𝐲0 𝐱1 𝐲1 𝐱2 𝐲2 𝐼 𝐖1,0, 𝐛1 𝐖2,1 , 𝐛2
𝑦𝑖0 = 𝑥′𝑖 ,
i =1,2, …, N0
𝑥𝑖1 = 𝑤𝑖,𝑗
1,0𝑦𝑗0 + 𝑏𝑖
1𝑁0𝑗=1
𝑦𝑖1= 𝜙(𝑥𝑖
1) i =1,2, …, N1
𝑥𝑖2 = 𝑤𝑖,𝑗
2,1𝑦𝑗1 + 𝑏𝑖
2𝑁1𝑗=1
𝑦𝑖2= 𝑥𝑖
2 i =1,2, …, N2
k0(x, x’) = Cov (𝑥𝑖1(x), 𝑥𝑖
1(x′)) = 𝜎𝑏2 +
𝜎𝑤2
𝑁0
x x′ regardless of the subscript i
k1(x, x’) = Cov (𝑥𝑖2(x), 𝑥𝑖
2(x′)) = 𝜎𝑏2 + 𝜎𝑤
2F(k0(x, x’), k0(x, x), k0(x’, x’) )
𝑥𝑖1 𝐱 𝐷𝑃(𝟎, 𝑘0(𝐱, 𝐱′)) 𝑥𝑖
2 𝐱 𝐺𝑃(0, 𝑘1(𝐱, 𝐱′))
9/18
Neural net (fully connected) & Gaussian process
y= 𝑓𝑁𝑁(𝐱;𝐖 , 𝐛 ) GP(0, 𝑘𝜎𝑤2, 𝜎𝑏
2 (x, x’))
𝐖𝑙+1,𝑙 𝐷 𝟎,𝜎𝑤2
𝑁𝑙𝐈 (𝒊. 𝒊. 𝒅)
𝐛𝑙+1 𝐷 𝟎, 𝜎𝑏2𝐈 (𝒊. 𝒊. 𝒅)
𝑁𝑙 ∞ for hidden layers
𝑘𝜎𝑤2, 𝜎𝑏
2 (x, x’) = 𝑘𝐿 (x, x’) = Cov (𝑥𝑖𝐿+1(x), 𝑥𝑖
𝐿+1(x′))
𝑘𝑙(x, x’) = 𝜎𝑏2 + 𝜎𝑤
2F (𝑘𝑙−1(x, x’), 𝑘𝑙−1(x, x), 𝑘𝑙−1(x’, x’) ) ,
𝑙 = 1, 2, …, 𝐿
𝑘0 (x, x’) = Cov (𝑥𝑖1(x), 𝑥𝑖
1(x′)) = 𝜎𝑏2 +
𝜎𝑤2
𝑁0
x x′
10/18
NNGP: Gaussian process with NN-induced covariance function
Discovering the equivalence between infinite NN and GP
(RM Neal, Bayesian learning for neural networks, 1994 )
Deducing the analytical GP kernels for single hidden-layer NN
with specific nonlinearity: error function or Gaussian
nonlinearity
(CKI William, Computing with infinite networks, 1997)
Deriving the GP kernel for multi-hidden-layer NN with general
nonlinearity based on signal propagation theory (Cho &
Saul,2009) (NNGP)
(J Lee et al, Deep neural networks as Gaussian processes, 2018)
11/18
NNGP: Gaussian process with NN-induced covariance function
NN GP Bayesian NN NNGP
Expressivity High Intermediate High High
Uncertainty No Yes Yes Yes
Cost Low intermediate High intermediate
Accuracy It depends It depends It depends It depends
12/18
Expressivity
Deep neural networks (DNN) can compactly express highly
complex functions over input space in a way that shallow networks
with one hidden layer and the same number of neurons cannot.
(Regression)
Deep neural networks can disentangle highly curved manifolds in
input space into flattened manifolds in hidden space.
(Classification)
GP model is actually a kernel learning machine whose performance
relies on selection of the covariance kernel.
NNGP enjoys the high expressivity of DNN.
Poole B, Lahiri S, Raghu M, Sohl-Dickstein J, Ganguli S.
Exponential expressivity in deep neural networks through transient chaos. In Advances in neural information processing systems 2016 (pp. 3360-3368).
13/18
Uncertainty quantification / surrogate accuracy
14/18
Accuracy (Regression problem)
15/18
Accuracy (Classification problem, J Lee et al, 2018)
16/18
Computational cost
NN: Loss function is generally convex -- stochastic gradient descent optimization algorithm (Training cost is generally O(n) where n is number of neurons)
GP: Loss function is non-convex -- gradient descent algorithms need multi-starting points. Inverting covariance matrix requires O(N3) complexity where N is the number of training datapoints.
Bayesian NN: Exact Bayesian inference is NP- hard. Approximate Bayesian inference could also be NP hard. Cost of MCMC for approximating posterior distribution of network parameters could be o(m2) where m is the parameters number.
NNGP: Same as GP
17/18
Future work – to improve, apply, and generalize NNGP
Training a NNGP
NNGP for solving PDEs (forward and inverse problems)
Develop NNGP for other types of NNs, say, CNN, RNN, GAN … …
18/18