A picture of the energy landscape of!deep neural networks
Pratik Chaudhari
December 15, 2017
1
UCLA VISION LAB
2
What is the shape?
How should it inform the optimization?
Dy (x ;w ) = �(wp
�(wp�1 (. . . �(w 1x)) . . .))
w
⇤ = argmin
w
≈(x ,y ) 2 D
kX
i=1
�1{y=i } log
Dy
i
(x ; w )
| {z }, f (w )
(S)GD
Many, many variants!AdaGrad, SVRG, rmsprop, Adam, Eve,!APPA, Catalyst, Natasha, Katyusha…
wk+1 = wk � ⌘ +fb(wk )
“variance reduction”
SGDLee et al., ‘16mere GD always!finds local minima
Ge et al., ’15!Sun et al., ’15, ‘16can escape strict!saddle points
Saxe et al., ‘14orthogonal!initializations
variance reduction
Schmidt et al. ’13!Defazio, et al. ’14
AdaGradDuchi et al., ’10
What do we know about the energy landscape?
3
multiple,!equivalent!
minima
linearnetworks
Duvenaud et al., ‘14deep GPs
Soundry & Carmon, ‘16piecewise linear
Haeffele & Vidal ’15matrix/tensor factorization
Baldi & Hornik ‘89PCA
our workHardt et al. ’15!Goodfellow & Vinyals ’15
generalizationcan easily get zero!training error, yet!generalize poorly
paradox between a “benign” energy landscape and delicate training algorithms
empirical!results
all local minima are!close to global minimum
Dauphin et al., ‘14
saddle points!slow down SGD
statistical physics
binary!perceptron
Bray & Dean ’07,Fyodorov & Williams ’07!Choromanska et al. ’15,!Chaudhari & Soatto, ‘15Gaussian random!fields, spin glasses
many descending!directions at high!energy
Baldassi et al. ’15
Motivation from the Hessian
4
�5 0 10 20 30 40Eigenvalues
0
10
103
105
Freq
uenc
y
�0.5 �0.4 �0.3 �0.2 �0.1 0.0Eigenvalues
0
102
103
104
Freq
uenc
y
Short negative tail
How do we exploit this?
5
Magnify the energy landscape and smooth with a kernel
w ⇤ = argminw
f (w )
= argmax
x
e
�f (w )
⇡ argmin
w� log
⇣G� ⇤ e�f (w )
⌘
Gaussian kernel!of variance g
focuses on the!neighborhood ofw
�0.5
0.0
0.5
1.0
1.5
xcandidate
bxF
bx f
f (x)
F(x, 103)
F(x, 2 ⇥ 104)
A physics interpretation
6
new global minimum
original global minimum
Our new loss is “local entropy”Baldassi et al., ’15, ‘16
f�(w ) = � logZ
w 0exp
�f (w 0) � 1
2�kw �w 0k2
!dw 0
Minimizing local entropy
7
‣ Solve
‣ Estimate the gradient using MCMC
can be applied to general deep networks
w ⇤ = argminw
f�(w , �)
‣ Gradient of local entropy
+f�(w , �) = ��1�w �
⌦w 0
↵�denotes an expectation over!a local Gibbs distribution
⌦w 0
↵= Z (w , �)�1
Z
w 0w 0 exp
�f (w 0) � 1
2�kw �w 0k2
!dw 0
Medium-scale CNN
8
‣ All-CNN-BN on CIFAR-10
‣ Do not see much plateauing of training or validation loss
0 50 100 150 200Epochs ⇥ L
5
10
15
20
%Er
ror
7.71%
7.81%
SGDEntropy-SGD
0 50 100 150 200Epochs ⇥ L
0
0.1
0.2
0.3
0.4
0.5
0.6
Cro
ss-E
ntro
pyLo
ss
0.03530.0336
SGDEntropy-SGD
x-axis wall-clock timeµ
9
A PDE interpretation
‣ Local entropy is the solution of a Hamilton-Jacobi equation
original loss as the initial condition
ut = �1
2`+u `2 +
1
2� u
u(w , 0) = f (w )
‣ Stochastic control interpretation
quadratic penalty for!greedy gradient descent
dw = �↵(s) ds + dB(s), t s T
w (t ) = w .
C(w (·), ↵(·)) = ≈
"f (w (T )) +
1
2
Z T
tk↵(s)k2 ds
#
↵(w , t ) = +u(w , t )
u(w , t ) = min↵(·)C(w (·), ↵(·))
f�(w ) = u(w , �)
New PDEs
10
‣Use the non-viscous HJ equation
ut = �1
2`+u `2 +
1
2� u
0
‣Hopf-Lax formula gives the solution “inf-convolution”!or Moreau envelope
u(w , t ) = infw 0
(f (w 0) +
1
2�kw �w 0k2
)
‣This has a few magical properties…
‣Simple formula for the gradient (proximal point iteration)
p⇤ = +f (w � t p⇤)+u(w , t ) = p⇤
Smoothing using PDEs
11
f (x)
uviscous HJ(x, T)
unon-viscous HJ(x, T)
initial density,!SGD gets stuck
final density of non-viscous HJfinal density of viscous HJ
Distributed training algorithms
12
‣ A continuous-time view of local entropy
fast variabledw = ���1(w �w 0) ds
dw 0 = �1✏
"+f (w 0) +
1
�(w 0 �w )
#ds +
1p✏dB(s)
Zhang et. al., ’15‣ Elastic-SGD
argminw , w 01,...,w 0p
pX
k=1
f (w 0k ) +1
2�kw 0k �w k2
“homogenized” dynamics!as e ! 0
‣ If w’(s) is very fast, w(s) only sees its average
dw = ���1⇣w �
⌦w 0
↵⌘ds
⇢1(w 0;w ) / exp
�f (w 0) � 1
2�kw 0 �w k2
!
+f�(w , �) = ��1�w �
⌦w 0
↵�
13
Wide-ResNet on CIFAR-10 / 100
13
WTH is implicit regularization?
14
Why is SGD so special?
‣ Fokker-Planck equation and optimal transportation
⇢⇤ = argmin
⇢
Z�(w ) ⇢(w )dw + ��1
Zlog ⇢ d⇢.
Information bottleneck, Bayesian inference, large batch-sizes, sampling techniques, hyper-parameter choices, neural architecture search, ….
Many, many variants
AdaGrad, SVRG,!SAG, rmsprop,!Adam, Eve,!APPA, Catalyst,!Natasha, Katyusha…
‣ Stochastic differential equation
dw = �+f (w ) d t|{z},⌘
+q2��1D (w ) dB(t )
��1 =⌘
2b
15
�(w )SGD does not minimize , what is ?f (w )
‣ Deep networks induce highly non-isotropic noise
CIFAR-10
�(D ) = 0.27 ± 0.84rank(D ) = 0.34%
div j (x) = 0
‣ Leads to deterministic, Hamiltonian dynamics in SGD
x = j (x)
Most likely trajectories of SGD are closed loops
saddle-point
‣ Deep networks have out-of-equilibrium distribution
⇢⇤(w ) 6/ e�� f (w )
/ e���(w )
16
Summary
‣Techniques from control and physics are interpretable, also leadto state-of-the-art algorithms
‣Control has powerful tools to make inroads into understandingand improving deep networks
PDEs, stochastic control, stability of limit cycles, Fokker-Planck equations, continuous-time analysis…
input-output stability, reinforcement learning & optimal control
‣Deep learning is powerful, and quite “easy” to get intoeven the fundamentals are unknown and debated upon
Thank You!
17
Joint work with
www.pratikac.info
Stefano Soatto!!
Guillaume Carlier, Adam Oberman, Stanley Osher!!
Yann LeCun, Anna Choromanska, Ameet Talwalkar!!
Carlo Baldassi, Christian Borgs, Jennifer Chayes, Riccardo Zecchina