mini-course 3: convergence analysis of neural …optimization i in practice, sgd always nds good...

109
Mini-Course 3: Convergence Analysis of Neural Network Yang Yuan Computer Science Department Cornell University

Upload: others

Post on 19-Jul-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Mini-Course 3:Convergence Analysis of Neural Network

Yang Yuan

Computer Science DepartmentCornell University

Page 2: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Deep learning is powerful

Page 3: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

What is neural network?

one

laye

rA simplified view.Missing: Convolution/BatchNorm.· · · · · · · · ·

· · · · · · · · ·

← Input x = (1, 2,−4)>

← Weight W =

1 0 0 10 1 0 10 0 1 1

← W>x = (1, 2,−4,−1)>

← ReLU(W>x) = (1, 2, 0, 0)>

Page 4: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

What is neural network?

one

laye

rA simplified view.Missing: Convolution/BatchNorm.· · · · · · · · ·

· · · · · · · · ·

← Input x = (1, 2,−4)>

← Weight W =

1 0 0 10 1 0 10 0 1 1

← W>x = (1, 2,−4,−1)>

← ReLU(W>x) = (1, 2, 0, 0)>

Page 5: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Three basic types of theory questions

I RepresentationI Can we express any functions with neural networks?

I Yes, very expressive: [Hornik et al., 1989, Cybenko, 1992,Barron, 1993, Eldan and Shamir, 2015,Safran and Shamir, 2016, Lee et al., 2017]

I OptimizationI Efficient methods for finding good parameters (i.e.,

representations)?

I GeneralizationI Training data used for optimization step.I Does it generalize to unseen data (test data)?

I Little is know. Flat minima? [Shirish Keskar et al., 2016,Hochreiter and Schmidhuber, 1995, Chaudhari et al., 2016,Zhang et al., 2016]

I In practice: neural network is doing great in ALL THREE!I What about in theory?

Page 6: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Three basic types of theory questions

I RepresentationI Can we express any functions with neural networks?

I Yes, very expressive: [Hornik et al., 1989, Cybenko, 1992,Barron, 1993, Eldan and Shamir, 2015,Safran and Shamir, 2016, Lee et al., 2017]

I OptimizationI Efficient methods for finding good parameters (i.e.,

representations)?

I GeneralizationI Training data used for optimization step.I Does it generalize to unseen data (test data)?

I Little is know. Flat minima? [Shirish Keskar et al., 2016,Hochreiter and Schmidhuber, 1995, Chaudhari et al., 2016,Zhang et al., 2016]

I In practice: neural network is doing great in ALL THREE!I What about in theory?

Page 7: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Three basic types of theory questions

I RepresentationI Can we express any functions with neural networks?I Yes, very expressive: [Hornik et al., 1989, Cybenko, 1992,

Barron, 1993, Eldan and Shamir, 2015,Safran and Shamir, 2016, Lee et al., 2017]

I OptimizationI Efficient methods for finding good parameters (i.e.,

representations)?

I GeneralizationI Training data used for optimization step.I Does it generalize to unseen data (test data)?

I Little is know. Flat minima? [Shirish Keskar et al., 2016,Hochreiter and Schmidhuber, 1995, Chaudhari et al., 2016,Zhang et al., 2016]

I In practice: neural network is doing great in ALL THREE!I What about in theory?

Page 8: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Three basic types of theory questions

I RepresentationI Can we express any functions with neural networks?I Yes, very expressive: [Hornik et al., 1989, Cybenko, 1992,

Barron, 1993, Eldan and Shamir, 2015,Safran and Shamir, 2016, Lee et al., 2017]

I OptimizationI Efficient methods for finding good parameters (i.e.,

representations)?

I GeneralizationI Training data used for optimization step.I Does it generalize to unseen data (test data)?I Little is know. Flat minima? [Shirish Keskar et al., 2016,

Hochreiter and Schmidhuber, 1995, Chaudhari et al., 2016,Zhang et al., 2016]

I In practice: neural network is doing great in ALL THREE!I What about in theory?

Page 9: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Optimization

I In practice, SGD always finds good local minima.I SGD: stochastic gradient descentI xt+1 = xt − ηgt , E [gt ] = ∇f (xt)

I Some results are negative, saying optimization for neuralnetworks is in general hard.

I [Sıma, 2002, Livni et al., 2014, Shamir, 2016].

I Or positive but with special algorithms (tensor decomposition,half space intersection, etc.)

I [Janzamin et al., 2015, Zhang et al., 2015,Sedghi and Anandkumar, 2015, Goel et al., 2016],

I With strong assumptions on the model (weights are complexnumbers, learning polynomials only, weights are iid random)

I [Andoni et al., 2014, Arora et al., 2014]

Page 10: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Optimization

I In practice, SGD always finds good local minima.I SGD: stochastic gradient descentI xt+1 = xt − ηgt , E [gt ] = ∇f (xt)

I Some results are negative, saying optimization for neuralnetworks is in general hard.

I [Sıma, 2002, Livni et al., 2014, Shamir, 2016].

I Or positive but with special algorithms (tensor decomposition,half space intersection, etc.)

I [Janzamin et al., 2015, Zhang et al., 2015,Sedghi and Anandkumar, 2015, Goel et al., 2016],

I With strong assumptions on the model (weights are complexnumbers, learning polynomials only, weights are iid random)

I [Andoni et al., 2014, Arora et al., 2014]

Page 11: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Optimization

I In practice, SGD always finds good local minima.I SGD: stochastic gradient descentI xt+1 = xt − ηgt , E [gt ] = ∇f (xt)

I Some results are negative, saying optimization for neuralnetworks is in general hard.

I [Sıma, 2002, Livni et al., 2014, Shamir, 2016].

I Or positive but with special algorithms (tensor decomposition,half space intersection, etc.)

I [Janzamin et al., 2015, Zhang et al., 2015,Sedghi and Anandkumar, 2015, Goel et al., 2016],

I With strong assumptions on the model (weights are complexnumbers, learning polynomials only, weights are iid random)

I [Andoni et al., 2014, Arora et al., 2014]

Page 12: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Optimization

I In practice, SGD always finds good local minima.I SGD: stochastic gradient descentI xt+1 = xt − ηgt , E [gt ] = ∇f (xt)

I Some results are negative, saying optimization for neuralnetworks is in general hard.

I [Sıma, 2002, Livni et al., 2014, Shamir, 2016].

I Or positive but with special algorithms (tensor decomposition,half space intersection, etc.)

I [Janzamin et al., 2015, Zhang et al., 2015,Sedghi and Anandkumar, 2015, Goel et al., 2016],

I With strong assumptions on the model (weights are complexnumbers, learning polynomials only, weights are iid random)

I [Andoni et al., 2014, Arora et al., 2014]

Page 13: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Recent work: independent activations

Independent activation assumption

I The outputs of ReLU units are independent of the input x,and independent of each other. [Choromanska et al., 2015,Kawaguchi, 2016, Brutzkus and Globerson, 2017].

← Input x = (1, 2,−4)>

← Weight W =

1 0 0 10 1 0 10 0 1 1

← W>x = (1, 2,−4,−1)>

← ReLU(W>x) = (1, 2, 0, 0)>

Independent!

Independent!

Page 14: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Recent work: independent activations

Independent activation assumption

I The outputs of ReLU units are independent of the input x,and independent of each other. [Choromanska et al., 2015,Kawaguchi, 2016, Brutzkus and Globerson, 2017].

← Input x = (1, 2,−4)>

← Weight W =

1 0 0 10 1 0 10 0 1 1

← W>x = (1, 2,−4,−1)>

← ReLU(W>x) = (1, 2, 0, 0)>

Independent!

Independent!

Page 15: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Recent work: independent activations

Independent activation assumption

I The outputs of ReLU units are independent of the input x,and independent of each other. [Choromanska et al., 2015,Kawaguchi, 2016, Brutzkus and Globerson, 2017].

← Input x = (1, 2,−4)>

← Weight W =

1 0 0 10 1 0 10 0 1 1

← W>x = (1, 2,−4,−1)>

← ReLU(W>x) = (1, 2, 0, 0)>

Independent!

Independent!

Page 16: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

CNN Model in [Brutzkus and Globerson, 2017]

Page 17: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Recent work: guarantees of other algorithm

I Tensor Decomposition+ Gradient Descent converges toground truth for one hidden layer network.[Zhong et al., 2017]

I Kernel methods could learn deep neural network witheigenvalue decay assumption. [Goel and Klivans, 2017]

Page 18: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Recent work: deep linear models

Ignore the activation functions.

I [Saxe et al., 2013, Kawaguchi, 2016, Hardt and Ma, 2016]

I Only learn a linear function.I Proof idea (for deep linear residual network)

I Loss: ‖(I + W`) · · · (I + W1)x − (Ax + b)‖2

I Compute ∂f∂Wi

.I Get lower bound for the gradient norm:‖∇f (W)‖2

F ≥ C (f (W)− f ∗)I Show C ≥ 0.I So f (W) = f ∗.

Page 19: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Recent work: deep linear models

Ignore the activation functions.

I [Saxe et al., 2013, Kawaguchi, 2016, Hardt and Ma, 2016]

I Only learn a linear function.

I Proof idea (for deep linear residual network)

I Loss: ‖(I + W`) · · · (I + W1)x − (Ax + b)‖2

I Compute ∂f∂Wi

.I Get lower bound for the gradient norm:‖∇f (W)‖2

F ≥ C (f (W)− f ∗)I Show C ≥ 0.I So f (W) = f ∗.

Page 20: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Recent work: deep linear models

Ignore the activation functions.

I [Saxe et al., 2013, Kawaguchi, 2016, Hardt and Ma, 2016]

I Only learn a linear function.I Proof idea (for deep linear residual network)

I Loss: ‖(I + W`) · · · (I + W1)x − (Ax + b)‖2

I Compute ∂f∂Wi

.I Get lower bound for the gradient norm:‖∇f (W)‖2

F ≥ C (f (W)− f ∗)I Show C ≥ 0.I So f (W) = f ∗.

Page 21: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Recent work: deep linear models

Ignore the activation functions.

I [Saxe et al., 2013, Kawaguchi, 2016, Hardt and Ma, 2016]

I Only learn a linear function.I Proof idea (for deep linear residual network)

I Loss: ‖(I + W`) · · · (I + W1)x − (Ax + b)‖2

I Compute ∂f∂Wi

.I Get lower bound for the gradient norm:‖∇f (W)‖2

F ≥ C (f (W)− f ∗)I Show C ≥ 0.I So f (W) = f ∗.

Page 22: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Recent work: deep linear models

Ignore the activation functions.

I [Saxe et al., 2013, Kawaguchi, 2016, Hardt and Ma, 2016]

I Only learn a linear function.I Proof idea (for deep linear residual network)

I Loss: ‖(I + W`) · · · (I + W1)x − (Ax + b)‖2

I Compute ∂f∂Wi

.

I Get lower bound for the gradient norm:‖∇f (W)‖2

F ≥ C (f (W)− f ∗)I Show C ≥ 0.I So f (W) = f ∗.

Page 23: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Recent work: deep linear models

Ignore the activation functions.

I [Saxe et al., 2013, Kawaguchi, 2016, Hardt and Ma, 2016]

I Only learn a linear function.I Proof idea (for deep linear residual network)

I Loss: ‖(I + W`) · · · (I + W1)x − (Ax + b)‖2

I Compute ∂f∂Wi

.I Get lower bound for the gradient norm:‖∇f (W)‖2

F ≥ C (f (W)− f ∗)

I Show C ≥ 0.I So f (W) = f ∗.

Page 24: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Recent work: deep linear models

Ignore the activation functions.

I [Saxe et al., 2013, Kawaguchi, 2016, Hardt and Ma, 2016]

I Only learn a linear function.I Proof idea (for deep linear residual network)

I Loss: ‖(I + W`) · · · (I + W1)x − (Ax + b)‖2

I Compute ∂f∂Wi

.I Get lower bound for the gradient norm:‖∇f (W)‖2

F ≥ C (f (W)− f ∗)I Show C ≥ 0.

I So f (W) = f ∗.

Page 25: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Recent work: deep linear models

Ignore the activation functions.

I [Saxe et al., 2013, Kawaguchi, 2016, Hardt and Ma, 2016]

I Only learn a linear function.I Proof idea (for deep linear residual network)

I Loss: ‖(I + W`) · · · (I + W1)x − (Ax + b)‖2

I Compute ∂f∂Wi

.I Get lower bound for the gradient norm:‖∇f (W)‖2

F ≥ C (f (W)− f ∗)I Show C ≥ 0.I So f (W) = f ∗.

Page 26: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Today’s paper: convergence analysis for two layer withReLU [Li and Yuan, 2017]

Below: our thought process, might be messy.

Page 27: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Our starting model

I A two layer model, not deep.

f (x ,W) = ‖ReLU(W>x)‖1

I Assume there exists a teacher network producing thelabels.

f (x ,W∗) = ‖ReLU(W∗>x)‖1

I Square loss.

L(W) = Ex [(f (x ,W)− f (x ,W∗))2]

I x ∼ N (0, I).I A common assumption [Choromanska et al., 2015,

Tian, 2016, Xie et al., 2017].

input x

W>x

ReLU(W>x)

Take sum

output

Page 28: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Our starting model

I A two layer model, not deep.

f (x ,W) = ‖ReLU(W>x)‖1

I Assume there exists a teacher network producing thelabels.

f (x ,W∗) = ‖ReLU(W∗>x)‖1

I Square loss.

L(W) = Ex [(f (x ,W)− f (x ,W∗))2]

I x ∼ N (0, I).I A common assumption [Choromanska et al., 2015,

Tian, 2016, Xie et al., 2017].

input x

W>x

ReLU(W>x)

Take sum

output

Page 29: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Our starting model

I A two layer model, not deep.

f (x ,W) = ‖ReLU(W>x)‖1

I Assume there exists a teacher network producing thelabels.

f (x ,W∗) = ‖ReLU(W∗>x)‖1

I Square loss.

L(W) = Ex [(f (x ,W)− f (x ,W∗))2]

I x ∼ N (0, I).I A common assumption [Choromanska et al., 2015,

Tian, 2016, Xie et al., 2017].

input x

W>x

ReLU(W>x)

Take sum

output

Page 30: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Our starting model

I A two layer model, not deep.

f (x ,W) = ‖ReLU(W>x)‖1

I Assume there exists a teacher network producing thelabels.

f (x ,W∗) = ‖ReLU(W∗>x)‖1

I Square loss.

L(W) = Ex [(f (x ,W)− f (x ,W∗))2]

I x ∼ N (0, I).I A common assumption [Choromanska et al., 2015,

Tian, 2016, Xie et al., 2017].

input x

W>x

ReLU(W>x)

Take sum

output

Page 31: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Our starting model

I A two layer model, not deep.

f (x ,W) = ‖ReLU(W>x)‖1

I Assume there exists a teacher network producing thelabels.

f (x ,W∗) = ‖ReLU(W∗>x)‖1

I Square loss.

L(W) = Ex [(f (x ,W)− f (x ,W∗))2]

I x ∼ N (0, I).I A common assumption [Choromanska et al., 2015,

Tian, 2016, Xie et al., 2017].

input x

W>x

ReLU(W>x)

Take sum

output

Page 32: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

However..

[Tian, 2016] showed, even if

I W initialized symmetrically

I W∗ forms orthonormal basis.

I Gradient descent may stuck atsaddle points.

Complicated surface, hard to analyze.

From [Tian, 2016]

Page 33: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

However.. [stuck]

[Tian, 2016] showed, even if

I W initialized symmetrically

I W∗ forms orthonormal basis.

I Gradient descent may stuck atsaddle points.

Complicated surface, hard to analyze.

From [Tian, 2016]

Page 34: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Residual network [He et al., 2016]

I state-of-the-art structure. (3300citations)

I Won a few competitions.

I Very powerful after stacking multipleblocks together.

I Easy to train.

A ResNet Block. From[He et al., 2016]

Page 35: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Residual network [He et al., 2016]

I state-of-the-art structure. (3300citations)

I Won a few competitions.

I Very powerful after stacking multipleblocks together.

I Easy to train.

A ResNet Block. From[He et al., 2016]

Page 36: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Residual network [He et al., 2016]

I state-of-the-art structure. (3300citations)

I Won a few competitions.

I Very powerful after stacking multipleblocks together.

I Easy to train.A ResNet Block. From[He et al., 2016]

Page 37: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Adding residual link?

I We modify the network:

f (x ,W) = ‖ReLU(W>x)‖1

to (adding identity)

f (x ,W) = ‖ReLU((I + W)>x)‖1

I Same for the ground truth f (x ,W∗).

I Essentially: move the weight by I.

input x

W>x

⊕ResidualLink +x

ReLU((I + W)>x)

Take sum

output

Page 38: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Adding residual link?

I We modify the network:

f (x ,W) = ‖ReLU(W>x)‖1

to (adding identity)

f (x ,W) = ‖ReLU((I + W)>x)‖1

I Same for the ground truth f (x ,W∗).

I Essentially: move the weight by I.

input x

W>x

⊕ResidualLink +x

ReLU((I + W)>x)

Take sum

output

Page 39: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Adding residual link?

I We modify the network:

f (x ,W) = ‖ReLU(W>x)‖1

to (adding identity)

f (x ,W) = ‖ReLU((I + W)>x)‖1

I Same for the ground truth f (x ,W∗).

I Essentially: move the weight by I.

input x

W>x

⊕ResidualLink +x

ReLU((I + W)>x)

Take sum

output

Page 40: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Adding residual link?

I We modify the network:

f (x ,W) = ‖ReLU(W>x)‖1

to (adding identity)

f (x ,W) = ‖ReLU((I + W)>x)‖1

I Same for the ground truth f (x ,W∗).

I Essentially: move the weight by I.input x

W>x

⊕ResidualLink +x

ReLU((I + W)>x)

Take sum

output

Page 41: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Ask Simulation: does SGD converge for this model?

Page 42: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Ask Simulation: does SGD converge for this model?

Simulation says: yes.

Page 43: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Illustration of our key observation

O

I

I + W∗

I + W

Easy for SGD

Unknown

Seems hard

Residual link

Page 44: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

How to prove this?

Page 45: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

How to prove this?

One-point convexity.

A function f (x) is called δ-one point strongly convex in domain Dwith respect to point x∗, if ∀x ∈ D,〈−∇f (x), x∗ − x〉 > δ‖x∗ − x‖2

2.

Page 46: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

How to prove this?

One-point convexity.

A function f (x) is called δ-one point strongly convex in domain Dwith respect to point x∗, if ∀x ∈ D,〈−∇f (x), x∗ − x〉 > δ‖x∗ − x‖2

2.

I A weaker condition than convexity.

I if it’s one point convex, we get toW∗ closer after every step, as longas the step size is small.

0

50

100

150

200

0

50

100

150

200

−5

0

5

10

15

Page 47: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

How to prove this?

One-point convexity.

A function f (x) is called δ-one point strongly convex in domain Dwith respect to point x∗, if ∀x ∈ D,〈−∇f (x), x∗ − x〉 > δ‖x∗ − x‖2

2.

I A weaker condition than convexity.

I if it’s one point convex, we get toW∗ closer after every step, as longas the step size is small.

0

50

100

150

200

0

50

100

150

200

−5

0

5

10

15

Page 48: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

One-point convex: an illustration

W∗

W1

W5

Page 49: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Ask Simulation: is it one point convex?

Page 50: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Ask Simulation: is it one point convex?

Simulation says: yes.

Page 51: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Compute 〈−∇L(W),W∗ −W〉

−∇L(W)j

=d∑

i=1

[π2

(w∗i − wi ) +(π

2− θi∗,j

)(ei + w∗i )−

(π2− θi ,j

)(ei + wi )

+ (‖ei + w∗i ‖2 sin θi∗,j − ‖ei + wi‖2 sin θi ,j)ej + wj

]

Page 52: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Compute 〈−∇L(W),W∗ −W〉

−∇L(W)j

=d∑

i=1

[π2

(w∗i − wi ) +(π

2− θi∗,j

)(ei + w∗i )−

(π2− θi ,j

)(ei + wi )

+ (‖ei + w∗i ‖2 sin θi∗,j − ‖ei + wi‖2 sin θi ,j)ej + wj

]

I ei ,wi ,w∗i are column vectors of I,W,W∗.

I θi ,j∗ : angle between ei + wi and ej + w∗j (Hard)

I sin θi ,j (Hard)

Page 53: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Taylor expansion: tedious calculation

What is θi ,j∗? It is the angle between wi + ei and w∗j + ej . We have

cos(θi ,j∗)

=〈wi + ei ,w

∗j + ej〉

‖wi + ei‖‖w∗j + ej‖=〈wi ,w

∗j 〉+ wi ,j + w∗j ,i

‖wi + ei‖‖w∗j + ej‖≈ (1− wi ,i )(1− w∗j ,j)(〈wi ,w

∗j 〉+ wi ,j + w∗j ,i )

≈ (1− wi ,i − w∗j ,j)(〈wi ,w∗j 〉+ wi ,j + w∗j ,i )

≈ 〈wi ,w∗j 〉+ wi ,j + w∗j ,i − wi ,iwi ,j − wi ,iw

∗j ,i − w∗j ,jwi ,j − w∗j ,jw

∗j ,i

Since we know arccos(x) ≈ π/2− x , we have

θi ,j∗ ≈π

2−〈wi ,w

∗j 〉+ wi ,j + w∗j ,i

‖wi + ei‖‖w∗j + ej‖

Page 54: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

However direct Taylor expansion is too loose

I In order to show 〈−∇L(W),W∗ −W〉 ≥ 0, we need toassume γ , max‖W‖2, ‖W∗‖2 ≤ O( 1

d )

I Super local region.

Page 55: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

However direct Taylor expansion is too loose

I In order to show 〈−∇L(W),W∗ −W〉 ≥ 0, we need toassume γ , max‖W‖2, ‖W∗‖2 ≤ O( 1

d )

I Super local region.

Page 56: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

However direct Taylor expansion is too loose [stuck]

I In order to show 〈−∇L(W),W∗ −W〉 ≥ 0, we need toassume γ , max‖W‖2, ‖W∗‖2 ≤ O( 1

d )

I Super local region.

Page 57: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Ask Simulation: what is the largest γ to satisfy one pointconvexity?

Page 58: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Ask Simulation: what is the largest γ to satisfy one pointconvexity?

Simulation says: Ω(1).

O

I

I + W∗

I + W

Easy for SGD

Unknown

Seems hard

Residual link

Page 59: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Taylor expansion: tedious calculation

What is θi ,j∗? It is the angle between wi + ei and w∗j + ej . We have

cos(θi ,j∗)

=〈wi + ei ,w

∗j + ej〉

‖wi + ei‖‖w∗j + ej‖=〈wi ,w

∗j 〉+ wi ,j + w∗j ,i

‖wi + ei‖‖w∗j + ej‖≈ (1− wi ,i )(1− w∗j ,j)(〈wi ,w

∗j 〉+ wi ,j + w∗j ,i )

≈ (1− wi ,i − w∗j ,j)(〈wi ,w∗j 〉+ wi ,j + w∗j ,i )

≈ 〈wi ,w∗j 〉+ wi ,j + w∗j ,i − wi ,iwi ,j − wi ,iw

∗j ,i − w∗j ,jwi ,j − w∗j ,jw

∗j ,i

Since we know arccos(x) ≈ π/2− x , we have

θi ,j∗ ≈π

2−〈wi ,w

∗j 〉+ wi ,j + w∗j ,i

‖wi + ei‖‖w∗j + ej‖

Page 60: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Use geometry to get tighter bounds!

I Denote ei + w∗i as−→OC , ei + wi as

−→OD, ei + w∗i as

−→OA, ei + wi as

−→OB. Thus, ‖w∗i − wi‖2 = ‖

−→DC‖2.

I Draw−→HB ‖

−→CD, so ‖

−→OH‖2 ≥ ‖

−→OB‖2 = ‖

−→OA‖2.

I Since 4CDO ∼ 4HBO, we have

‖−→CD‖2

‖−→HB‖2

=‖−→OD‖2

‖−→OB‖2

= ‖−→OD‖2 ≥ 1− γ

I So ‖−→CD‖2 ≥ (1− γ)‖

−→HB‖2.

O

A B

CD

E

FGH

OC : ei + w∗iOA : ei + w∗i

OD : ei + wi

OB : ei + wi

Page 61: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Use geometry to get tighter bounds!

I Denote ei + w∗i as−→OC , ei + wi as

−→OD, ei + w∗i as

−→OA, ei + wi as

−→OB. Thus, ‖w∗i − wi‖2 = ‖

−→DC‖2.

I Draw−→HB ‖

−→CD, so ‖

−→OH‖2 ≥ ‖

−→OB‖2 = ‖

−→OA‖2.

I Since 4CDO ∼ 4HBO, we have

‖−→CD‖2

‖−→HB‖2

=‖−→OD‖2

‖−→OB‖2

= ‖−→OD‖2 ≥ 1− γ

I So ‖−→CD‖2 ≥ (1− γ)‖

−→HB‖2.

O

A B

CD

E

FGH

OC : ei + w∗iOA : ei + w∗i

OD : ei + wi

OB : ei + wi

Page 62: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Use geometry to get tighter bounds!

I Denote ei + w∗i as−→OC , ei + wi as

−→OD, ei + w∗i as

−→OA, ei + wi as

−→OB. Thus, ‖w∗i − wi‖2 = ‖

−→DC‖2.

I Draw−→HB ‖

−→CD, so ‖

−→OH‖2 ≥ ‖

−→OB‖2 = ‖

−→OA‖2.

I Since 4CDO ∼ 4HBO, we have

‖−→CD‖2

‖−→HB‖2

=‖−→OD‖2

‖−→OB‖2

= ‖−→OD‖2 ≥ 1− γ

I So ‖−→CD‖2 ≥ (1− γ)‖

−→HB‖2.

O

A B

CD

E

FGH

OC : ei + w∗iOA : ei + w∗i

OD : ei + wi

OB : ei + wi

Page 63: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Use geometry to get tighter bounds!

I Denote ei + w∗i as−→OC , ei + wi as

−→OD, ei + w∗i as

−→OA, ei + wi as

−→OB. Thus, ‖w∗i − wi‖2 = ‖

−→DC‖2.

I Draw−→HB ‖

−→CD, so ‖

−→OH‖2 ≥ ‖

−→OB‖2 = ‖

−→OA‖2.

I Since 4CDO ∼ 4HBO, we have

‖−→CD‖2

‖−→HB‖2

=‖−→OD‖2

‖−→OB‖2

= ‖−→OD‖2 ≥ 1− γ

I So ‖−→CD‖2 ≥ (1− γ)‖

−→HB‖2.

O

A B

CD

E

FGH

OC : ei + w∗iOA : ei + w∗i

OD : ei + wi

OB : ei + wi

Page 64: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Central Lemma: geometric lemma

I 4ABO is a isosceles triangle, so ‖−→AG‖2 = ‖

−→GB‖2.

I ‖−→HB‖2 ≥ ‖

−→AB‖2 = 2‖

−→GB‖2.

I ‖−→GB‖2 ≤ ‖

−→HB‖2

2 ≤ ‖−→CD‖2

2(1−γ) .

I ‖−→GB‖2 ≤ ‖

−→CD‖2

2(1−γ) .

O

A B

CD

E

FGH

OC : ei + w∗iOA : ei + w∗i

OD : ei + wi

OB : ei + wi

‖−→CD‖2 ≥ (1− γ)‖

−→HB‖2

Page 65: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Central Lemma: geometric lemma

I 4ABO is a isosceles triangle, so ‖−→AG‖2 = ‖

−→GB‖2.

I ‖−→HB‖2 ≥ ‖

−→AB‖2 = 2‖

−→GB‖2.

I ‖−→GB‖2 ≤ ‖

−→HB‖2

2 ≤ ‖−→CD‖2

2(1−γ) .

I ‖−→GB‖2 ≤ ‖

−→CD‖2

2(1−γ) .

O

A B

CD

E

FGH

OC : ei + w∗iOA : ei + w∗i

OD : ei + wi

OB : ei + wi

‖−→CD‖2 ≥ (1− γ)‖

−→HB‖2

Page 66: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Central Lemma: geometric lemma

I 4ABO is a isosceles triangle, so ‖−→AG‖2 = ‖

−→GB‖2.

I ‖−→HB‖2 ≥ ‖

−→AB‖2 = 2‖

−→GB‖2.

I ‖−→GB‖2 ≤ ‖

−→HB‖2

2 ≤ ‖−→CD‖2

2(1−γ) .

I ‖−→GB‖2 ≤ ‖

−→CD‖2

2(1−γ) .

O

A B

CD

E

FGH

OC : ei + w∗iOA : ei + w∗i

OD : ei + wi

OB : ei + wi

‖−→CD‖2 ≥ (1− γ)‖

−→HB‖2

Page 67: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Central Lemma: geometric lemma

I 4ABO is a isosceles triangle, so ‖−→AG‖2 = ‖

−→GB‖2.

I ‖−→HB‖2 ≥ ‖

−→AB‖2 = 2‖

−→GB‖2.

I ‖−→GB‖2 ≤ ‖

−→HB‖2

2 ≤ ‖−→CD‖2

2(1−γ) .

I ‖−→GB‖2 ≤ ‖

−→CD‖2

2(1−γ) .

O

A B

CD

E

FGH

OC : ei + w∗iOA : ei + w∗i

OD : ei + wi

OB : ei + wi

‖−→CD‖2 ≥ (1− γ)‖

−→HB‖2

Page 68: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Central Lemma: geometric lemma

I 4ABE ∼ 4BGO

I

‖−→AE‖2

‖−→AB‖2

=‖−→OG‖2

‖−→OB‖2

=

√1− ‖

−→GB‖2

2

1

I

‖−→AE‖2

‖−→AB‖2

√1−

‖−→CD‖2

2

4(1− γ)2≥

√1−

1− γ

)2

O

A B

CD

E

FGH

OC : ei + w∗iOA : ei + w∗i

OD : ei + wi

OB : ei + wi

‖−→GB‖2 ≤ ‖

−→CD‖2

2(1−γ)

‖wi − w∗i ‖2 ≤ 2γ

Page 69: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Central Lemma: geometric lemma

I 4ABE ∼ 4BGO

I

‖−→AE‖2

‖−→AB‖2

=‖−→OG‖2

‖−→OB‖2

=

√1− ‖

−→GB‖2

2

1

I

‖−→AE‖2

‖−→AB‖2

√1−

‖−→CD‖2

2

4(1− γ)2≥

√1−

1− γ

)2

O

A B

CD

E

FGH

OC : ei + w∗iOA : ei + w∗i

OD : ei + wi

OB : ei + wi

‖−→GB‖2 ≤ ‖

−→CD‖2

2(1−γ)

‖wi − w∗i ‖2 ≤ 2γ

Page 70: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Central Lemma: geometric lemma

I 4ABE ∼ 4BGO

I

‖−→AE‖2

‖−→AB‖2

=‖−→OG‖2

‖−→OB‖2

=

√1− ‖

−→GB‖2

2

1

I

‖−→AE‖2

‖−→AB‖2

√1−

‖−→CD‖2

2

4(1− γ)2≥

√1−

1− γ

)2

O

A B

CD

E

FGH

OC : ei + w∗iOA : ei + w∗i

OD : ei + wi

OB : ei + wi

‖−→GB‖2 ≤ ‖

−→CD‖2

2(1−γ)

‖wi − w∗i ‖2 ≤ 2γ

Page 71: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Central Lemma: geometric lemma

I ‖−→CD‖2 ≥ ‖

−→CF‖2

I 4CFO ∼ 4AEO

I ‖W∗‖2 ≤ γI

‖−→CD‖2

‖−→AE‖2

≥ ‖−→CF‖2

‖−→AE‖2

=‖−→OC‖2

‖−→OA‖2

= ‖ei + w∗i ‖2 ≥ 1− γ

I ‖−→CD‖2 ≥ (1− γ)‖

−→AE‖2 ≥

(1− γ)

√1−

1−γ

)2‖−→AB‖2

I ‖−→AB‖2 ≤ ‖

−→CD‖2√1−2γ

=‖w∗i −wi‖2√

1−2γ

O

A B

CD

E

FGH

OC : ei + w∗iOA : ei + w∗i

OD : ei + wi

OB : ei + wi

‖−→AE‖2

‖−→AB‖2

≥√

1−(

γ1−γ

)2

Page 72: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Central Lemma: geometric lemma

I ‖−→CD‖2 ≥ ‖

−→CF‖2

I 4CFO ∼ 4AEO

I ‖W∗‖2 ≤ γI

‖−→CD‖2

‖−→AE‖2

≥ ‖−→CF‖2

‖−→AE‖2

=‖−→OC‖2

‖−→OA‖2

= ‖ei + w∗i ‖2 ≥ 1− γ

I ‖−→CD‖2 ≥ (1− γ)‖

−→AE‖2 ≥

(1− γ)

√1−

1−γ

)2‖−→AB‖2

I ‖−→AB‖2 ≤ ‖

−→CD‖2√1−2γ

=‖w∗i −wi‖2√

1−2γ

O

A B

CD

E

FGH

OC : ei + w∗iOA : ei + w∗i

OD : ei + wi

OB : ei + wi

‖−→AE‖2

‖−→AB‖2

≥√

1−(

γ1−γ

)2

Page 73: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Central Lemma: geometric lemma

I ‖−→CD‖2 ≥ ‖

−→CF‖2

I 4CFO ∼ 4AEO

I ‖W∗‖2 ≤ γ

I

‖−→CD‖2

‖−→AE‖2

≥ ‖−→CF‖2

‖−→AE‖2

=‖−→OC‖2

‖−→OA‖2

= ‖ei + w∗i ‖2 ≥ 1− γ

I ‖−→CD‖2 ≥ (1− γ)‖

−→AE‖2 ≥

(1− γ)

√1−

1−γ

)2‖−→AB‖2

I ‖−→AB‖2 ≤ ‖

−→CD‖2√1−2γ

=‖w∗i −wi‖2√

1−2γ

O

A B

CD

E

FGH

OC : ei + w∗iOA : ei + w∗i

OD : ei + wi

OB : ei + wi

‖−→AE‖2

‖−→AB‖2

≥√

1−(

γ1−γ

)2

Page 74: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Central Lemma: geometric lemma

I ‖−→CD‖2 ≥ ‖

−→CF‖2

I 4CFO ∼ 4AEO

I ‖W∗‖2 ≤ γI

‖−→CD‖2

‖−→AE‖2

≥ ‖−→CF‖2

‖−→AE‖2

=‖−→OC‖2

‖−→OA‖2

= ‖ei + w∗i ‖2 ≥ 1− γ

I ‖−→CD‖2 ≥ (1− γ)‖

−→AE‖2 ≥

(1− γ)

√1−

1−γ

)2‖−→AB‖2

I ‖−→AB‖2 ≤ ‖

−→CD‖2√1−2γ

=‖w∗i −wi‖2√

1−2γ

O

A B

CD

E

FGH

OC : ei + w∗iOA : ei + w∗i

OD : ei + wi

OB : ei + wi

‖−→AE‖2

‖−→AB‖2

≥√

1−(

γ1−γ

)2

Page 75: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Central Lemma: geometric lemma

I ‖−→CD‖2 ≥ ‖

−→CF‖2

I 4CFO ∼ 4AEO

I ‖W∗‖2 ≤ γI

‖−→CD‖2

‖−→AE‖2

≥ ‖−→CF‖2

‖−→AE‖2

=‖−→OC‖2

‖−→OA‖2

= ‖ei + w∗i ‖2 ≥ 1− γ

I ‖−→CD‖2 ≥ (1− γ)‖

−→AE‖2 ≥

(1− γ)

√1−

1−γ

)2‖−→AB‖2

I ‖−→AB‖2 ≤ ‖

−→CD‖2√1−2γ

=‖w∗i −wi‖2√

1−2γ

O

A B

CD

E

FGH

OC : ei + w∗iOA : ei + w∗i

OD : ei + wi

OB : ei + wi

‖−→AE‖2

‖−→AB‖2

≥√

1−(

γ1−γ

)2

Page 76: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Central Lemma: geometric lemma

I |〈ei + w∗i − ei + wi , ei + wi 〉| = ‖−→BE‖2

I 4ABE ∼ 4GBO

I

‖−→BE‖2

‖−→AB‖2

=‖−→GB‖2

‖−→BO‖2

=‖−→AB‖2

2

O

A B

CD

E

FGH

OC : ei + w∗iOA : ei + w∗i

OD : ei + wi

OB : ei + wi

‖wi − w∗i ‖2 ≤ 2γ

‖−→AB‖2 ≤

‖w∗i −wi‖2√1−2γ

Page 77: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Central Lemma: geometric lemma

I |〈ei + w∗i − ei + wi , ei + wi 〉| = ‖−→BE‖2

I 4ABE ∼ 4GBO

I

‖−→BE‖2

‖−→AB‖2

=‖−→GB‖2

‖−→BO‖2

=‖−→AB‖2

2

O

A B

CD

E

FGH

OC : ei + w∗iOA : ei + w∗i

OD : ei + wi

OB : ei + wi

‖wi − w∗i ‖2 ≤ 2γ

‖−→AB‖2 ≤

‖w∗i −wi‖2√1−2γ

Page 78: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Central Lemma: geometric lemma

I |〈ei + w∗i − ei + wi , ei + wi 〉| = ‖−→BE‖2

I 4ABE ∼ 4GBO

I

‖−→BE‖2

‖−→AB‖2

=‖−→GB‖2

‖−→BO‖2

=‖−→AB‖2

2

O

A B

CD

E

FGH

OC : ei + w∗iOA : ei + w∗i

OD : ei + wi

OB : ei + wi

‖wi − w∗i ‖2 ≤ 2γ

‖−→AB‖2 ≤

‖w∗i −wi‖2√1−2γ

Page 79: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Central Lemma: geometric lemma

I |〈ei + w∗i − ei + wi , ei + wi 〉| = ‖−→BE‖2

I 4ABE ∼ 4GBO

I

‖−→BE‖2

‖−→AB‖2

=‖−→GB‖2

‖−→BO‖2

=‖−→AB‖2

2

|〈ei + w∗i − ei + wi , ei + wi 〉| =‖−→AB‖2

2

2≤‖w∗i − wi‖2

2

2(1− 2γ)

This is a very tight bound O(γ2).

O

A B

CD

E

FGH

OC : ei + w∗iOA : ei + w∗i

OD : ei + wi

OB : ei + wi

‖wi − w∗i ‖2 ≤ 2γ

‖−→AB‖2 ≤

‖w∗i −wi‖2√1−2γ

Page 80: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Is it enough..?

With tight bounds, we get, if ‖W0‖2, ‖W∗‖2 ≤ γ = Ω(1):

〈−∇L(W),W∗ −W〉 >(

0.084− (1 + γ)g

2(1− 2γ)

)‖W∗ −W‖2

F

I g is a potential function:

g ,d∑

i=1

(‖ei + w∗i ‖2 − ‖ei + wi‖2)

I d is input dimension.

I How does g affect OPC?

Page 81: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Is it enough..?

With tight bounds, we get, if ‖W0‖2, ‖W∗‖2 ≤ γ = Ω(1):

〈−∇L(W),W∗ −W〉 >(

0.084− (1 + γ)g

2(1− 2γ)

)‖W∗ −W‖2

F

I g is a potential function:

g ,d∑

i=1

(‖ei + w∗i ‖2 − ‖ei + wi‖2)

I d is input dimension.

I How does g affect OPC?

Page 82: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Ask Simulation: what if g is large?

Page 83: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Ask Simulation: what if g is large?

Simulation:

Page 84: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Ask Simulation: what if g is large?

Simulation:

Page 85: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Ask Simulation: what if g is large? [stuck]

Simulation:

Page 86: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Ask Simulation: will g always decrease with SGD?

Page 87: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Ask Simulation: will g always decrease with SGD?

Simulation: Yes (for lots of instances)

Page 88: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

g controls the dynamics!

PI PII

Page 89: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

The actual dynamics

W1

W∗

W6

W10

Phase I: W1 →W6, W may go to the wrong direction. Phase II:W6 →W10, W gets closer to W∗ in every step by one point convexity.

Page 90: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Two phase framework

I Phase II g decreases to a small value.I Technique: analyze the dynamics of SGD

I Phase III One point convex regionI Get closer to W∗ after every stepI g is always smallI Technique: compute inner product

Page 91: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Phase I: g keeps decreasing

Page 92: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Phase I: g keeps decreasing

I g ,∑d

i=1(‖ei + w∗i ‖2 − ‖ei + wi‖2).

Page 93: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Phase I: g keeps decreasing

I g ,∑d

i=1(‖ei + w∗i ‖2 − ‖ei + wi‖2).

What is ∆g here?

ei + wi

O

ei ∆wi

∆‖ei + wi‖2

Page 94: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Phase I: g keeps decreasing

I g ,∑d

i=1(‖ei + w∗i ‖2 − ‖ei + wi‖2).

What is ∆g here?

∆g ≈∑i

〈−∆wi , ei + wi 〉 = 〈η∇L(W), I + W〉

≈〈η∇L(W), I〉 = ηTr(∇L(W))

ei + wi

O

ei ∆wi

∆‖ei + wi‖2

Page 95: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Phase I: g keeps decreasing

I g ,∑d

i=1(‖ei + w∗i ‖2 − ‖ei + wi‖2).

What is ∆g here?

∆g ≈∑i

〈−∆wi , ei + wi 〉 = 〈η∇L(W), I + W〉

≈〈η∇L(W), I〉 = ηTr(∇L(W))

What is Tr(∇L(W))?

Tr(∇L(W)) ≈ O(Tr(W∗ −W)) + O(Tr((W∗ −W)uu>)) + dg ei + wi

O

ei ∆wi

∆‖ei + wi‖2

Page 96: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Phase I: g keeps decreasing

I g ,∑d

i=1(‖ei + w∗i ‖2 − ‖ei + wi‖2).

What is ∆g here?

∆g ≈∑i

〈−∆wi , ei + wi 〉 = 〈η∇L(W), I + W〉

≈〈η∇L(W), I〉 = ηTr(∇L(W))

What is Tr(∇L(W))?

Tr(∇L(W)) ≈ O(Tr(W∗ −W)) + O(Tr((W∗ −W)uu>)) + dg ei + wi

O

ei ∆wi

∆‖ei + wi‖2

Page 97: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Phase I: g keeps decreasing

I g ,∑d

i=1(‖ei + w∗i ‖2 − ‖ei + wi‖2).

What is ∆g here?

∆g ≈∑i

〈−∆wi , ei + wi 〉 = 〈η∇L(W), I + W〉

≈〈η∇L(W), I〉 = ηTr(∇L(W))

What is Tr(∇L(W))?

Tr(∇L(W)) ≈ O(Tr(W∗ −W)) + O(Tr((W∗ −W)uu>)) + dg ei + wi

O

ei ∆wi

∆‖ei + wi‖2

Page 98: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Phase I: g keeps decreasing

Observation:

Tr(W∗ −W) =d∑i

(1 + w∗i,i − 1− wi,i ) ≈d∑i

(‖ei + w∗i ‖2 − ‖ei + wi‖2) = g

Page 99: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Phase I: g keeps decreasing

Observation:

Tr(W∗ −W) =d∑i

(1 + w∗i,i − 1− wi,i ) ≈d∑i

(‖ei + w∗i ‖2 − ‖ei + wi‖2) = g

But Tr((W∗ −W)uu>) is hard to bound. Therefore, we considerthe joint updating rule of s , (W∗ −W)u and g .(Below is for illustration only)

‖st+1‖2 ≈ 0.9‖st‖2 + 10η|gt ||gt+1| ≈ 0.9|gt |+ 10η‖st‖2

When 10η < 0.05, ‖st+1‖2 + |gt+1| ≤ 0.95(‖st‖2 + |gt |). So |gt |will be very small.

Page 100: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Main result

Main Theorem (informal).

If input is from Gaussian distribution, ‖W0‖2, ‖W∗‖2 ≤ γ(constant), step size is small, SGD with mini batch and initial pointW0 will reach W∗ after polynomial number of steps, in two phases.

Page 101: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Main result

Main Theorem (informal).

If input is from Gaussian distribution, ‖W0‖2, ‖W∗‖2 ≤ γ(constant), step size is small, SGD with mini batch and initial pointW0 will reach W∗ after polynomial number of steps, in two phases.

Matches with standard O(1/√d) initialization schemes. (d is

input dimension.)

Page 102: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Open questions

I Multiple layers?

I Other input distributions?

I Convolutional networks?

I Residual link that skips two layers?

I Identify different potential functions for other non-convexproblems?

Page 103: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Andoni, A., Panigrahy, R., Valiant, G., and Zhang, L. (2014).Learning polynomials with neural networks.In ICML, pages 1908–1916.

Arora, S., Bhaskara, A., Ge, R., and Ma, T. (2014).Provable bounds for learning some deep representations.In Proceedings of the 31th International Conference onMachine Learning, ICML 2014, Beijing, China, 21-26 June2014, pages 584–592.

Barron, A. R. (1993).Universal approximation bounds for superpositions of asigmoidal function.IEEE Trans. Information Theory, 39(3):930–945.

Brutzkus, A. and Globerson, A. (2017).Globally optimal gradient descent for a convnet with gaussianinputs.In ICML 2017.

Page 104: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y.,Baldassi, C., Borgs, C., Chayes, J., Sagun, L., and Zecchina,R. (2016).Entropy-SGD: Biasing Gradient Descent Into Wide Valleys.ArXiv e-prints.

Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., andLeCun, Y. (2015).The loss surfaces of multilayer networks.In AISTATS.

Cybenko, G. (1992).Approximation by superpositions of a sigmoidal function.MCSS, 5(4):455.

Eldan, R. and Shamir, O. (2015).The Power of Depth for Feedforward Neural Networks.ArXiv e-prints.

Goel, S., Kanade, V., Klivans, A. R., and Thaler, J. (2016).Reliably learning the relu in polynomial time.

Page 105: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

CoRR, abs/1611.10258.

Goel, S. and Klivans, A. (2017).Eigenvalue decay implies polynomial-time learnability forneural networks.In NIPS 2017.

Hardt, M. and Ma, T. (2016).Identity matters in deep learning.CoRR, abs/1611.04231.

He, K., Zhang, X., Ren, S., and Sun, J. (2016).Deep residual learning for image recognition.In CVPR, pages 770–778.

Hochreiter, S. and Schmidhuber, J. (1995).Simplifying neural nets by discovering flat minima.In Advances in Neural Information Processing Systems 7,pages 529–536. MIT Press.

Hornik, K., Stinchcombe, M. B., and White, H. (1989).Multilayer feedforward networks are universal approximators.

Page 106: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Neural Networks, 2(5):359–366.

Janzamin, M., Sedghi, H., and Anandkumar, A. (2015).Beating the perils of non-convexity: Guaranteed training ofneural networks using tensor methods.arXiv preprint arXiv:1506.08473.

Kawaguchi, K. (2016).Deep learning without poor local minima.In NIPS, pages 586–594.

Lee, H., Ge, R., Risteski, A., Ma, T., and Arora, S. (2017).On the ability of neural nets to express distributions.ArXiv e-prints.

Li, Y. and Yuan, Y. (2017).Convergence analysis of two-layer neural networks with reluactivation.In NIPS 2017.

Livni, R., Shalev-Shwartz, S., and Shamir, O. (2014).On the computational efficiency of training neural networks.

Page 107: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

In NIPS, pages 855–863.

Safran, I. and Shamir, O. (2016).Depth-Width Tradeoffs in Approximating Natural Functionswith Neural Networks.ArXiv e-prints.

Saxe, A. M., McClelland, J. L., and Ganguli, S. (2013).Exact solutions to the nonlinear dynamics of learning in deeplinear neural networks.CoRR, abs/1312.6120.

Sedghi, H. and Anandkumar, A. (2015).Provable methods for training neural networks with sparseconnectivity.ICLR.

Shamir, O. (2016).Distribution-specific hardness of learning neural networks.CoRR, abs/1609.01037.

Page 108: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Shirish Keskar, N., Mudigere, D., Nocedal, J., Smelyanskiy,M., and Tang, P. T. P. (2016).On Large-Batch Training for Deep Learning: GeneralizationGap and Sharp Minima.ArXiv e-prints.

Sıma, J. (2002).Training a single sigmoidal neuron is hard.Neural Computation, 14(11):2709–2728.

Tian, Y. (2016).Symmetry-breaking convergence analysis of certain two-layeredneural networks with relu nonlinearity.In Submitted to ICLR 2017.

Xie, B., Liang, Y., and Song, L. (2017).Diversity leads to generalization in neural networks.In AISTATS.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O.(2016).

Page 109: Mini-Course 3: Convergence Analysis of Neural …Optimization I In practice, SGD always nds good local minima. I SGD: stochastic gradient descent I x t+1 = x t g t, E[g t] = rf(x t)

Understanding deep learning requires rethinking generalization.

ArXiv e-prints.

Zhang, Y., Lee, J. D., Wainwright, M. J., and Jordan, M. I.(2015).Learning halfspaces and neural networks with randominitialization.CoRR, abs/1511.07948.

Zhong, K., Song, Z., Jain, P., Bartlett, P. L., and Dhillon, I. S.(2017).Recovery guarantees for one-hidden-layer neural networks.In ICML 2017.