understanding deep learning requires rethinking generalization

IDS Lab

Understanding deep learning requires rethinking generalization

Does deep learning really doing some generalization?presented by Jamie Seol

IDS Lab

Jamie Seol

Motivation• Normally, we measure a generalization by:

• generalization error = |training error - test error|• if we overfit, the training error should be low, while test error

becomes large = high generalization error!• However, a complex neural network is fragile to be overfitted!

• for example, let’s train some human baby by randomly labeled CIFAR-10 dataset

• then, give’em some sample in the training set (2nd+ epoch)• they will say "what the…" to any question

• because it’s impossible to generalize some kind of abtracted concept!

• what about in neural network?

IDS Lab

Jamie Seol

CIFAR-10• This is the CIFAR-10 dataset

• The goal of this task is to classify given image into one of 10 classes

• CNNs that we know well will solve this rather easily

IDS Lab

Jamie Seol

Randomized CIFAR-10• When we randomize information of CIFAR-10’s training set, the

result of accuracy becomes:

IDS Lab

Jamie Seol

Randomized CIFAR-10• This is just nothing more than over-overfit!• What’s the problem than?

• neural networks memorized datasets• even if it should have no meaning!

• it’s random! raaaaandddddddommm!!!• aaaaarrrrrrrr!!!

• it did not generalize some concepts• it just memorized!!!!

IDS Lab

Jamie Seol

Randomized CIFAR-10• Even if you didn’t intend to, neural nets can just memorize thing

rather than generalizing!• According to the experiment,

• the effective capacity of neural network is sufficient for memorizing the entire data set

• randomizing (corrupting) data set makes task harder just by small constant factor compared to the origial task!

• Again, even if you didn’t want to!! neural network is fragile to overfit in natural sense!!• "You don’t have to explain the meanings. I’ll just memorize it" - Chatur,

from the movie "3 Idiots"

IDS Lab

Jamie Seol

Regularization• However, we do know that there are a lot of techniques for

regularization, which supports generalizations!• dropout, batch norm, early stopping, weigh decay…

• It does seem help, but wait….• can someone prove that regularizations fundamentally

improves generalization?• does this works really really well? really???

IDS Lab

Jamie Seol

Regularization• Isn’t data augmentation significantly more important than weight

decay?• Even with regulizations, neural networks are good memorizers• Just changing the model increased test accuracy

IDS Lab

Jamie Seol

Regularization• Early stopping helps

• but not necessarily…

IDS Lab

Jamie Seol

Regularization• Well… these techniques seem does helpful, but suspicion

remains…

IDS Lab

Jamie Seol

Rademacher complexity• By the way, what’s so big deal about memorizing everything?• The following measurement is called Rademacher complexity

• Detailed math is omitted here• The thing is, if some model can memorize everything (actually, if

the hypothesis have power to fit randomized dataset), then theoritical upper bound of generalization error is just 1• which is useless!!!!• actually, using regularization scheme lowers the bound, but this

is not true in ReLU, and we’ll show that there is some situation that regularization helps nothing

IDS Lab

Jamie Seol

Finite-sample expressivity• Remember Universal Approximation Theorem?

• finite-sample expressivity theorem is more practical version of it• note that this statement shows that UAT does not guarantees

generalization!• Theorem 1: there exists a 2-layer NN by ReLU with 2n+d weights that

can represent any function given by n samples in d dimensions• This is not a hard theorem to prove, so let’s do it

IDS Lab

Jamie Seol

Lemma 1• Lemma 1: for b1 < x1 < b2 < … < bn < xn, matrix A = [ReLU(xi - bj)]ij has

full rank • Proof: obvious

IDS Lab

Jamie Seol

Theorem 1• Theorem 1: there exists a 2-layer NN by ReLU with 2n+d weights that can

represent any function given by n samples in d dimensions • Proof: Note that 2-layered neural network with ReLU can be expressed as

• where w, b ∈ ℝn and a ∈ ℝd • for data S = {z1, …, zn} and label y ∈ ℝn where zi ∈ ℝd, WTS yi = NN2(zi)

for all i from 1 to n • choose a, b so that xi = ⟨a, zi⟩ meets the condition for Lemma 1 • Then, this becomes y = Aw, while Lemma 1 says that A is invertable • done

IDS Lab

Jamie Seol

Finite-sample expressivity• What does it mean?• It means that once you have more than about 2n + d parameters,

your model already possesses a willingful power to super-overfit and just to remember everything instead of generalizing some concept, therefore it gains trivial bound for generalization error and is exposed to sudden-death-danger of doing nothing more than a memorizer

• long story short: we can’t speak formally about generalization in deep learning yet

• a snake’s leg: for deeper network, use intermediate layers to choose splitted interval rather than target, resulting similar O(n + k) parameters required

IDS Lab

Jamie Seol

Stochastic Gradient Descent• Let’s think about linear optimization

• If we have large d, which is a underdetermined problem, then we can have multiple globla minima

• But hey, can we determine which optima gives best generalization?• in non-linear systems, peeking curvature helped• but there’s no such thing as a curvature in linear system!

IDS Lab

Jamie Seol

Stochastic Gradient Descent• Funny thing about SGD is, it gives optima for l2 loss for

underdetermined system, and known to be a regularizer itself

IDS Lab

Jamie Seol

Stochastic Gradient Descent• However… the result shows minimum l2 norm wasn’t always the

global optima in sense of generalization• furthermore, it is possible to generate some dataset that

minimum l2 norm is not optima! a constructive counter example!

• adding l2 regularization to parameters didn’t help a bit (not shown in the table)

norm = 220

norm = 390

IDS Lab

Jamie Seol

Conclusion• "Be careful whenever you speak 'generalization' in deep learning"• Contributions of this paper:

• experimental framework for suspecting suspicious activities of generalization techniques

• proof for lack of theoritical boundary of generalization error in deep learning (since it can just memorize it all with small effective capacity)

• optimization does not necessarily means generalization• "beware of the light" - Caliban, from the movie "Logan"

IDS Lab

Jamie Seol

References• Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking

generalization." arXiv preprint arXiv:1611.03530 (2016). • https://www.slideshare.net/JungHoonSeo2/understanding-deep-learning-

requires-rethinking-generalization-2017-12 • https://www.slideshare.net/JungHoonSeo2/understanding-deep-learning-

requires-rethinking-generalization-2017-2-22

https://www.slideshare.net/JungHoonSeo2/understanding-deep-learning-requires-rethinking-generalization-2017-12

https://www.slideshare.net/JungHoonSeo2/understanding-deep-learning-requires-rethinking-generalization-2017-12

https://www.slideshare.net/JungHoonSeo2/understanding-deep-learning-requires-rethinking-generalization-2017-2-22

https://www.slideshare.net/JungHoonSeo2/understanding-deep-learning-requires-rethinking-generalization-2017-2-22

understanding deep learning requires rethinking generalization

Engineering